What hidden computations happen inside transformer layers during reasoning?
This explores what's actually computed inside transformer layers while a model reasons — the hidden machinery beneath the visible chain-of-thought text.
This explores what's actually computed inside transformer layers while a model reasons — the hidden machinery beneath the visible chain-of-thought text. The corpus's most striking finding is that the reasoning you see is often not the reasoning that happened. One study using a "logit lens" found that models compute the correct answer in their earliest layers (1–3), then actively suppress those representations in later layers to emit format-compliant filler tokens — the real reasoning is still recoverable from lower-ranked predictions, hiding in plain sight Do transformers hide reasoning before producing filler tokens?. This reframes a whole research question: reasoning is better studied as a trajectory through hidden states than as the surface text it produces, with chain-of-thought acting only as a partial, sometimes misleading interface Where does LLM reasoning actually happen during generation?.
If the visible trace is just an interface, the natural next question is whether you need it at all — and several architectures say no. Depth-recurrent and compressed-token models (Heima, Coconut) scale test-time compute by iterating on hidden states rather than generating words, suggesting verbalization is a training habit, not a requirement for reasoning Can models reason without generating visible thinking tokens?. One 27M-parameter model solved Sudoku-Extreme and 30×30 mazes perfectly through pure latent computation while chain-of-thought methods scored zero Can models reason without generating visible thinking steps?. That's a doorway worth opening: the showy paragraph of "thinking" may be the least important part of what the model is doing.
But peer closer at what these hidden computations *are*, and the picture gets less flattering. Controlled training shows multi-hop reasoning emerges in three stages — memorization, in-distribution generalization, then cross-distribution reasoning — and that success correlates with a geometric signature: entity representations clustering by cosine similarity How do transformers learn to reason across multiple steps?. Yet other work argues what looks like systematic reasoning is really linearized subgraph matching: transformers memorize computation patterns from training and stitch them together, failing badly on genuinely novel compositions as errors compound step by step Do transformers actually learn systematic compositional reasoning?. The reasoning trace itself can be stylistic mimicry — invalid logical steps perform nearly as well as valid ones, meaning semantic correctness isn't what drives the gains Do reasoning traces show how models actually think?.
Two deeper framings make this stranger. One is that transformers don't *store* knowledge and look it up — knowledge exists as continuous flow through the residual stream, performed fresh each time, closer to oral recitation than to a database Do transformer models store knowledge or generate it continuously?. Another is architectural: attention aggregates all tokens through weighted parallel summation rather than selectively suppressing irrelevant ones, which is why models read additively and miss the frame-shifts behind jokes and wordplay — a missing cognitive operation, not a knowledge gap Why do AI systems miss jokes and wordplay so consistently?. And the theoretical ceiling is high: a single finite transformer is provably Turing-complete given the right prompt, even if ordinary training rarely teaches it to behave that way Can a single transformer become universally programmable through prompts?.
The thread tying these together — and the thing you might not have known you wanted to know — is that identical outputs can hide radically different internal mechanisms, and pushing one metric like accuracy often quietly degrades another like faithfulness What actually happens inside a language model?. There's even a privacy cost to this hidden machinery: reasoning chains materialize sensitive user data as cognitive scaffolding, and 74.8% of leaks come from the model simply recollecting it mid-thought Do reasoning traces actually expose private user data?. So "what happens inside the layers" isn't one answer — it's a gap between the performance and the computation, and most of the interesting research lives in that gap.
Sources 12 notes
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Depth-recurrent and compressed-token architectures solve reasoning tasks through hidden computation rather than output tokens. A 27M-parameter model solved Sudoku-Extreme and 30×30 mazes perfectly while CoT methods scored zero.
Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.
Transformers integrate token information through weighted parallel aggregation rather than selective suppression of irrelevant words. This structural difference explains consistent failures with jokes, wordplay, and frame-dependent meaning—not knowledge gaps, but missing cognitive operations.
Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.
Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.
74.8% of privacy leaks in language model reasoning traces result from models materializing sensitive user data during thought processes. Longer reasoning chains amplify leakage, and anonymizing traces post-hoc degrades model utility, suggesting private data functions as cognitive scaffolding.