Does RLVR actually improve mathematical reasoning or just coherence?
RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.
RLVR verifies only the final answer and distributes rewards uniformly across all tokens. Its impact on intermediate reasoning tokens — which are not directly incentivized — has not been formally studied. Using a First-Order Logic (FOL)-based error taxonomy to classify errors in intermediate steps, the investigation reveals a nuanced picture.
RLVR post-training does improve trace coherence — the local consistency of reasoning steps as measured by error patterns. The improvement is strongest on problems where the base model fails but the RL-trained model succeeds. Reasoning traces become more internally consistent, with fewer identifiable logical errors between adjacent steps.
However, trace coherence is not trace validity. Coherence measures local consistency — each step follows plausibly from the previous one. Validity implies global logical soundness — the entire chain constitutes a correct mathematical proof. Coherent traces can be globally invalid: a chain of locally plausible steps can still reach a wrong conclusion or contain a valid-seeming path that skips essential justification.
Since What do models actually learn from chain-of-thought training?, this finding extends the pattern: RLVR, like long CoT training, optimizes for structural properties (local coherence) rather than semantic properties (global validity). The reward signal from final-answer verification creates pressure toward "traces that look right" rather than "traces that are right." The uniform distribution of advantages across tokens means the model has no mechanism to specifically improve at the critical reasoning junctures.
Since Does chain-of-thought reasoning reveal genuine inference or pattern matching?, the coherence-validity gap is the RLVR-specific manifestation of the broader CoT-as-imitation pattern. The model learns the form of coherent reasoning (adjacent steps that fit together) without necessarily learning the substance (valid logical derivation).
The coherence-validity distinction maps directly onto the faithfulness framework: since Do language models actually use their reasoning steps?, RLVR's coherence improvement addresses neither criterion. Improved local coherence means adjacent steps follow plausibly from each other (a structural property), but does not establish that those steps are causally sufficient (removing them would degrade the answer) or causally necessary (no spurious steps are present). RLVR-improved traces may look more faithful while being no more causally grounded — the structural surface improves while the causal substance remains unverified.
Claims that RLVR "improves reasoning" should be examined carefully: what improves is trace coherence (perceived quality), not necessarily trace validity (actual mathematical correctness).
Source: RLVR
Related concepts in this collection
-
What do models actually learn from chain-of-thought training?
When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
RLVR coherence improvement is the same dynamic: structural over semantic
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
coherence-validity gap is RLVR-specific CoT-as-imitation
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
coherent traces invite anthropomorphic trust
-
Do chain of thought traces actually help humans understand reasoning?
When models show their work through chain of thought traces, do humans find them interpretable? Research tested whether the traces that improve model performance also improve human understanding.
coherence optimizes perceived quality, not actual validity
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
RLVR coherence addresses neither faithfulness criterion: local plausibility does not establish causal sufficiency or necessity of reasoning steps
-
Does fine-tuning weaken how reasoning steps influence answers?
When models are fine-tuned on domain-specific tasks, do their chain-of-thought reasoning steps actually causally drive the final answer, or do they become decorative? This matters because accurate outputs can mask unfaithful reasoning.
parallel phenomenon: SFT degrades faithfulness (reasoning steps causally disconnected from answers) while RLVR improves coherence without validity; both show that training can improve surface reasoning quality while leaving or worsening the causal grounding problem
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
rlvr improves trace coherence without guaranteeing trace validity — local consistency gains should not be mistaken for improved mathematical reasoning