Why do language models use remaining tokens to rationalize instead of reconsider?
This explores why, once an LLM has started down a line of reasoning, it tends to spend the rest of its output defending that line rather than backing up and revising it — and what in the mechanics of token generation makes 'continue' so much cheaper than 'reconsider.'
This explores why a model, having committed to a direction mid-sentence, keeps justifying it rather than reversing course — and the corpus points to the answer being baked into how generation works, not a failure of effort. The cleanest framing comes from the observation that token prediction is a smooth probabilistic flow: a model is trained to continue toward its training distribution, not to explore the logically related counter-positions to whatever it just said Does LLM generation explore competing claims while producing text?. Reconsidering would mean introducing turbulence — a sharp break with the text already on the page — and that's exactly the move the objective smooths away. So the remaining tokens flow toward coherence with the prefix, and coherence with a claim looks indistinguishable from rationalizing it.
There's a deeper reason the prefix has such gravity. The 20-questions regeneration test shows that a model never really 'commits' to a position the way a person does — it holds a superposition and samples, and every continuation it produces is generated to stay consistent with the prior context Do large language models actually commit to a single character?. Once a few tokens land, they become part of that prior context, and the most probable next tokens are the ones that cohere with them. Reconsidering requires treating your own earlier output as wrong, but the machinery is built to treat it as a constraint to satisfy. The same dynamic shows up when context loses to training priors: models generate outputs inconsistent with information right in front of them because strong parametric associations dominate, and prompting alone can't override them Why do language models ignore information in their context?. Self-revision is a special case of the harder problem — getting the model to weight new evidence over an established lean.
What makes this feel like rationalization specifically is that the reasoning text isn't doing the work it appears to do. Reasoning traces function as persuasive appearances rather than reliable accounts of computation — invalid logical steps perform nearly as well as valid ones Do reasoning traces show how models actually think?, and deliberately corrupted traces train models about as well as correct ones, which means the trace is computational scaffolding, not meaning Do reasoning traces need to be semantically correct?. If the prose was never the seat of the reasoning, then post-hoc justification is the natural output: fluent text that supports the answer without ever having derived it. You can even watch the gap open up — transformers compute answers in early layers and then overwrite those representations with format-compliant filler in the final layers Do transformers hide reasoning before producing filler tokens?.
The unsettling corollary is that what reads as careful reasoning may be a default dressed up. When constraints are removed from a task, twelve of fourteen models get *worse*, revealing they were exploiting a conservative bias — defaulting to the harder option — rather than evaluating anything Are models actually reasoning about constraints or just defaulting conservatively?. That's rationalization in miniature: the output narrates a justification for a choice the model arrived at by a shortcut. And since only a small minority of tokens are genuine high-entropy decision points where the model could fork Do high-entropy tokens drive reasoning model improvements?, most of the remaining tokens are low-stakes continuation by construction — there simply aren't many positions where 'reconsider' is even on the table.
The thing you might not have expected to learn: the fix probably isn't asking the model to try harder to revise in words. If reasoning lives in hidden states rather than verbalized tokens Can models reason without generating visible thinking tokens?, and if diffusion-style architectures can refine an answer and its justification *simultaneously* instead of locking in a left-to-right prefix Can reasoning and answers be generated separately in language models?, then 'reconsidering' may require breaking the autoregressive commitment to the prefix itself — not better prompting, but a generation process that isn't structurally obligated to agree with what it already said.
Sources 10 notes
Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.