LLM Reasoning and Architecture

Do transformers hide reasoning before producing filler tokens?

Explores whether language models compute correct answers in early layers but then deliberately overwrite them with filler tokens in later layers, suggesting reasoning and output formatting are separable processes.

Note · 2026-02-23 · sourced from Cognitive Models Latent

When transformers are trained to solve reasoning tasks with filler (hidden) characters replacing explicit CoT tokens, a striking pattern emerges through logit lens analysis:

Layers 1-3: Correct numerical tokens from the reasoning computation appear as top predictions. The model is performing the actual computation in these early layers.

Layer 3 transition: Filler tokens begin appearing among top-ranked predictions, competing with the computational results.

Final layer: Filler tokens dominate top predictions; correct computational tokens are relegated to rank-2 or lower. The model has overwritten the intermediate reasoning representations with format-compliant output tokens.

The hidden computations are fully recoverable by examining lower-ranked tokens during decoding. The model performs the reasoning, stores the results in its representations, then actively overwrites them to produce the expected output format. The mechanism likely involves induction heads — pattern-copying circuits that learn to overwrite based on training distribution patterns.

This finding has two important implications. First, it provides mechanistic evidence for Why does reasoning training help math but hurt medical tasks? with a twist: the computation happens in earlier layers, but the overwriting also happens in higher layers. The functional separation is computation-in-early-layers, formatting-in-late-layers, not simply knowledge-down/reasoning-up.

Second, it demonstrates a distinction between instance-adaptive and parallelizable computation. Instance-adaptive CoT requires caching subproblem solutions within token outputs — later tokens depend on earlier results. This dependency structure is incompatible with parallel filler token computation. The hidden computation in filler tokens works for tasks where the full solution can be computed in a single forward pass, but not for problems requiring sequential dependency between reasoning steps.

This connects to the CoT faithfulness literature: if models can compute correct answers without explicit reasoning tokens, the explicit CoT chain is not necessarily the mechanism producing the answer. The overwriting pattern suggests the model has two separable processes — computation and expression — that may not align. See Do language models actually use their reasoning steps?.


Source: Cognitive Models Latent

Related concepts in this collection

Concept map
14 direct connections · 141 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

transformers perform hidden reasoning computations in earlier layers then overwrite intermediate representations with filler tokens in later layers