Do transformers hide reasoning before producing filler tokens?
Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.
When transformers are trained to solve reasoning tasks with filler (hidden) characters replacing explicit CoT tokens, a striking pattern emerges through logit lens analysis:
Layers 1-3: Correct numerical tokens from the reasoning computation appear as top predictions. The model is performing the actual computation in these early layers.
Layer 3 transition: Filler tokens begin appearing among top-ranked predictions, competing with the computational results.
Final layer: Filler tokens dominate top predictions; correct computational tokens are relegated to rank-2 or lower. The model has overwritten the intermediate reasoning representations with format-compliant output tokens.
The hidden computations are fully recoverable by examining lower-ranked tokens during decoding. The model performs the reasoning, stores the results in its representations, then actively overwrites them to produce the expected output format. The mechanism likely involves induction heads — pattern-copying circuits that learn to overwrite based on training distribution patterns.
This finding has two important implications. First, it provides mechanistic evidence for Why does reasoning training help math but hurt medical tasks? with a twist: the computation happens in earlier layers, but the overwriting also happens in higher layers. The functional separation is computation-in-early-layers, formatting-in-late-layers, not simply knowledge-down/reasoning-up.
Second, it demonstrates a distinction between instance-adaptive and parallelizable computation. Instance-adaptive CoT requires caching subproblem solutions within token outputs — later tokens depend on earlier results. This dependency structure is incompatible with parallel filler token computation. The hidden computation in filler tokens works for tasks where the full solution can be computed in a single forward pass, but not for problems requiring sequential dependency between reasoning steps.
This connects to the CoT faithfulness literature: if models can compute correct answers without explicit reasoning tokens, the explicit CoT chain is not necessarily the mechanism producing the answer. The overwriting pattern suggests the model has two separable processes — computation and expression — that may not align. See Do language models actually use their reasoning steps?.
Source: Cognitive Models Latent
Related concepts in this collection
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
refines: early layers compute, late layers format; the separation is functional, not just knowledge vs reasoning
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
the overwriting mechanism explains HOW encoded information fails to influence generation: later layers actively suppress it
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
hidden computation explains why CoT can be unfaithful: the model may use a different internal computation path
-
Does chain of thought reasoning actually explain model decisions?
When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.
the computation-expression separation extends to agentic pipelines
-
What mechanism enables models to retrieve from long context?
Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?
retrieval heads complement hidden filler reasoning: early-layer hidden computations produce intermediate results that retrieval heads access during generation — the filler overwrite pattern explains why specialized retrieval heads are necessary: if intermediate representations are overwritten, the model must retrieve from earlier positions via these sparse attention heads
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
transformers perform hidden reasoning computations in earlier layers then overwrite intermediate representations with filler tokens in later layers