What makes a first answer so often the best answer a model produces?
This explores why a model's initial answer is so often its best — and the corpus suggests two very different mechanisms are tangled together here: models genuinely reach the answer early, and models are also biased to prefer whatever they said first whether or not it's right.
This explores why a model's initial answer is so often its best one — and the interesting thing is that the corpus offers two competing reasons that look identical from the outside but mean opposite things. One is flattering to the model; the other is a warning.
The flattering version: models really do know the answer early. Diffusion language models reach the correct answer roughly halfway through decoding — up to 99% of MMLU and 97% of GSM8K instances are already settled before refinement finishes, which is exactly why you can stop early and lose nothing Can diffusion models commit to answers before full decoding?. The extra steps aren't discovering the answer; they're elaborating one already chosen. The same shape shows up in chain-of-thought: accuracy follows an inverted-U against length, and more capable models prefer *shorter* chains, with reinforcement learning naturally drifting toward brevity as models improve Why does chain of thought accuracy eventually decline with length?. For simple questions, letting the question flow straight to an answer beats forcing step-by-step reasoning at all Why do some questions perform better without step-by-step reasoning?. In all these, the first answer wins because the thinking after it is decoration, not discovery — and sometimes the decoration actively hurts, as when reasoning models can't stop themselves from grinding away on ill-posed questions a non-reasoning model would simply reject Why do reasoning models overthink ill-posed questions?.
The warning version is darker: the first answer wins not because it's best but because the model is rigged to trust it. Models systematically over-validate text they themselves generated, because a high-probability answer *feels* more correct when the same model grades it — a self-agreement loop you only break by forcing comparison against outside alternatives Why do models trust their own generated answers?. That's the mechanism behind a lot of failed self-correction: the model isn't re-examining, it's re-confirming. Confidence amplifies this — highly confident models resist rephrasing and revision, which is great when they're right and a trap when they're wrong Does model confidence predict robustness to prompt changes?.
What makes the two versions hard to tell apart is that our metrics can't see the difference. Supervised fine-tuning raises final-answer accuracy while cutting the quality of the actual inferential steps by nearly 39% — the model arrives at correct answers through post-hoc rationalization rather than genuine reasoning, and benchmarks that only score the final answer never notice Does supervised fine-tuning improve reasoning or just answers?. So a 'good first answer' can be a model that genuinely solved it fast, or a model that committed early and built a justification backward. Both look like a confident first answer that's hard to improve on.
The quietly useful takeaway: the fix isn't 'think more' — overthinking has its own failure modes — it's knowing *when* the first answer is trustworthy. That's why some systems now train an explicit router that decides between answering immediately and engaging extended reasoning, instead of always doing one or the other Can models learn when to think versus respond quickly?. The best answer being the first answer isn't a property of the model so much as a property of the question — and the open problem is teaching models to tell which kind of question they're holding.
Sources 8 notes
Up to 99% of MMLU instances and 97% of GSM8K instances reach correct answers by the midpoint of refinement. Prophet exploits this by monitoring confidence gaps to stop early, achieving 3.4× speedup with no quality loss.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Saliency analysis reveals that CoT prompting fails when question information doesn't aggregate into the prompt structure before reasoning begins. For simple questions, direct question-to-answer flow outperforms step-by-step reasoning, showing the optimal prompt depends on question type, not just task category.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.