LLM Reasoning and Architecture Reinforcement Learning for LLMs

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.

Note · 2026-02-22 · sourced from RLVR
How should researchers navigate LLM reasoning research? Do reasoning traces show how models actually think? What does reward learning actually do to model reasoning?

RLVR verifies only the final answer and distributes rewards uniformly across all tokens. Its impact on intermediate reasoning tokens — which are not directly incentivized — has not been formally studied. Using a First-Order Logic (FOL)-based error taxonomy to classify errors in intermediate steps, the investigation reveals a nuanced picture.

RLVR post-training does improve trace coherence — the local consistency of reasoning steps as measured by error patterns. The improvement is strongest on problems where the base model fails but the RL-trained model succeeds. Reasoning traces become more internally consistent, with fewer identifiable logical errors between adjacent steps.

However, trace coherence is not trace validity. Coherence measures local consistency — each step follows plausibly from the previous one. Validity implies global logical soundness — the entire chain constitutes a correct mathematical proof. Coherent traces can be globally invalid: a chain of locally plausible steps can still reach a wrong conclusion or contain a valid-seeming path that skips essential justification.

Since What do models actually learn from chain-of-thought training?, this finding extends the pattern: RLVR, like long CoT training, optimizes for structural properties (local coherence) rather than semantic properties (global validity). The reward signal from final-answer verification creates pressure toward "traces that look right" rather than "traces that are right." The uniform distribution of advantages across tokens means the model has no mechanism to specifically improve at the critical reasoning junctures.

Since Does chain-of-thought reasoning reveal genuine inference or pattern matching?, the coherence-validity gap is the RLVR-specific manifestation of the broader CoT-as-imitation pattern. The model learns the form of coherent reasoning (adjacent steps that fit together) without necessarily learning the substance (valid logical derivation).

The coherence-validity distinction maps directly onto the faithfulness framework: since Do language models actually use their reasoning steps?, RLVR's coherence improvement addresses neither criterion. Improved local coherence means adjacent steps follow plausibly from each other (a structural property), but does not establish that those steps are causally sufficient (removing them would degrade the answer) or causally necessary (no spurious steps are present). RLVR-improved traces may look more faithful while being no more causally grounded — the structural surface improves while the causal substance remains unverified.

Claims that RLVR "improves reasoning" should be examined carefully: what improves is trace coherence (perceived quality), not necessarily trace validity (actual mathematical correctness).


Source: RLVR

Related concepts in this collection

Concept map
13 direct connections · 107 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rlvr improves trace coherence without guaranteeing trace validity — local consistency gains should not be mistaken for improved mathematical reasoning