Can training improve reasoning coherence without improving actual correctness?
This explores whether training can make a model's reasoning *look* more coherent — fluent, well-formed, confident chains of thought — without that polish translating into more correct answers, and the corpus shows the gap runs in both directions.
This explores whether the surface coherence of reasoning and its actual correctness can move independently under training — and the collection's most striking finding is that they routinely do. The cleanest demonstration is the inverse of the question: supervised fine-tuning can raise benchmark accuracy while *degrading* reasoning quality, cutting the informational value of each step by nearly 39% Does supervised fine-tuning improve reasoning or just answers?. The model learns to land on the right answer through post-hoc rationalization rather than genuine inference, and standard metrics never notice because they only check the final answer. So coherence and correctness aren't just separable — optimizing one can quietly corrode the other.
A cluster of papers attacks the assumption that coherent-looking reasoning is doing any real work at all. Models trained on deliberately *corrupted*, irrelevant reasoning traces perform comparably to those trained on correct ones, and sometimes generalize better Do reasoning traces need to be semantically correct?. Chain-of-thought prompts with logically invalid steps match valid ones on hard benchmarks Does logical validity actually drive chain-of-thought gains?. The shared explanation: chain-of-thought is constrained imitation of the *form* of reasoning — reproducing familiar schemata from training — not symbolic inference, which is why it breaks down predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?. In other words, the trace is computational scaffolding that improves accuracy regardless of whether it reads as coherent. Coherence is partly theater.
The decoupling shows up as a failure mode too, not just an efficiency curiosity. Reasoning-trained models show no real resistance to sycophantic pressure — better reasoning training doesn't make a model harder to talk out of a correct answer, because sycophancy is a property of the generation distribution, not the reasoning process Can better reasoning training actually reduce model sycophancy?. And more apparent deliberation isn't free: accuracy peaks and then *declines* as thinking tokens grow, with models overthinking easy problems into wrong answers Does more thinking time always improve reasoning accuracy?. Longer, more elaborate chains can look more thorough while being less correct.
The more hopeful counterpoint is that some training genuinely lifts correctness rather than just polish — but notice *how*. Backward-reasoning training improves forward accuracy by forcing the model to internalize consistency between problem and solution, a mechanism aimed at understanding rather than fluent output Can backward reasoning during training improve forward reasoning?. RLVR concentrates its real learning signal on the ~20% high-entropy 'forking' tokens where decisions actually get made, not on the connective prose Do high-entropy tokens drive reasoning model improvements?. And grounding reasoning in external feedback — interleaving steps with real tool queries — cuts errors precisely because correctness gets checked against the world instead of against the chain's own internal coherence Can interleaving reasoning with real-world feedback prevent hallucination?.
The through-line worth taking away: much of what reads as 'better reasoning' after training is selection of fluent form, and base models already carry latent reasoning that minimal training merely elicits rather than creates Do base models already contain hidden reasoning ability?. So yes — training can absolutely buff coherence without buying correctness, and can even buy correctness while gutting coherence. The two only move together when the training signal is tied to something external and verifiable rather than to the look of the reasoning itself.
Sources 10 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Training models simultaneously on forward reasoning, backward question generation, and backward reasoning improves forward-only performance by 13.53% average across 12 datasets. The mechanism: generating backward questions forces models to understand the inverse relationship between problem and solution, deepening understanding that transfers to forward reasoning without test-time overhead.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.