Can models maintain reasoning-output coupling while improving domain accuracy?

This explores the apparent trade-off between teaching a model new domain knowledge and keeping its reasoning steps genuinely wired to its answers — and whether the corpus has found training methods that improve accuracy without letting the 'thinking' become decorative.

This reads the question as a worry about coupling: when you train a model to get more answers right in a domain, does the chain-of-thought it shows still cause the answer, or does it drift into after-the-fact theater? The corpus has a sharp, repeated finding here — and it's mostly bad news for the most common training recipe, with a few promising exceptions.

The central warning is that supervised fine-tuning buys accuracy at the cost of coupling. One study calls it the 'SFT accuracy trap': benchmark scores climb while the actual inferential contribution of each reasoning step falls by nearly 39%, because the model learns to produce correct answers through post-hoc rationalization rather than genuine step-by-step work Does supervised fine-tuning improve reasoning or just answers?. A companion result shows this directly with faithfulness tests — after fine-tuning, you can truncate, paraphrase, or replace the reasoning with filler and the final answer often doesn't budge, meaning the steps have stopped driving the output Does fine-tuning disconnect reasoning steps from final answers?. The most vivid version: transformers can compute the real answer in their early layers and then actively overwrite it to emit format-compliant filler, so the visible 'reasoning' is decorative while the real work hides underneath Do transformers hide reasoning before producing filler tokens?. Standard accuracy metrics are blind to all of this, which is exactly why the decoupling goes unnoticed.

The good news is that the failure seems tied to *how* you train, not to some hard ceiling. Several reward-based approaches improve domain competence while explicitly rewarding the reasoning, not just the final token. RLAG rewards both answer accuracy and the rationality of the explanation, cycling between augmented and unaugmented generation so the model internalizes coherent knowledge structures — and it beats SFT precisely because it prioritizes reasoning quality over token-level correctness Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. RLSF uses the model's own answer-span confidence to rank reasoning traces, strengthening step-by-step inference while reversing the calibration damage that RLHF tends to cause — no human labels needed Can model confidence work as a reward signal for reasoning?. And for smaller models, DPO trained on correct-vs-incorrect examples outperforms SFT by giving explicit negative examples that target the rigid output failures SFT leaves behind Can small models match large models on function calling?. The pattern: signals that reward the *process* preserve coupling; signals that reward only the final answer erode it.

There's a deeper reframing worth knowing. Multiple lines suggest training rarely *creates* reasoning — it elicits what's already latent. Five independent methods all unlock reasoning that base models already carry in their activations, implying post-training selects rather than builds Do base models already contain hidden reasoning ability?. RLVR's gains concentrate on a small minority of high-entropy 'forking' tokens — the genuine decision points — and training on just those ~20% matches full updates Do high-entropy tokens drive reasoning model improvements?. This is encouraging for coupling: if good training is mostly steering existing reasoning toward the right forks, the reasoning machinery doesn't have to be flattened to gain accuracy. But there's a hard limit on the 'domain accuracy' half — prompting and prompt optimization can only reorganize knowledge already in the model; they cannot inject missing domain facts Can prompt optimization teach models knowledge they lack?. So genuine new-domain accuracy does require training, which is exactly where the coupling risk lives.

The thing you might not have expected to learn: the answer is yes, you *can* keep reasoning coupled while gaining domain accuracy — but only if you stop optimizing for the final answer alone. The corpus reframes 'reasoning faithfulness' from a fuzzy interpretability concern into a measurable training-objective problem, and even hints at architectural escapes — diffusion LLMs that refine reasoning and answer in place, letting answer confidence converge early while reasoning keeps working, cutting compute in half without losing accuracy Can reasoning and answers be generated separately in language models?.

Sources 10 notes

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can reasoning and answers be generated separately in language models?

ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.

Can models maintain reasoning-output coupling while improving domain accuracy?

Sources 10 notes

Next inquiring lines