Why are receiver attention heads narrower in reasoning models than base models?

This explores why the attention heads that 'receive' or aggregate information from across the context appear to concentrate on fewer, more selective signals in models trained for reasoning compared to their base counterparts — and the corpus doesn't name 'receiver heads' directly, so I'm reading it through what we know about attention-head specialization, sparsification under load, and how reasoning training reshapes (rather than rebuilds) base-model machinery.

This explores why information-aggregating attention heads look narrower in reasoning models than in base models. No note in the collection uses the exact phrase 'receiver attention heads,' so the honest move is to triangulate from three nearby findings rather than pretend there's a paper that answers it head-on. What the corpus does say, repeatedly, is that the heads doing the heavy lifting are already a tiny minority — and that reasoning training narrows attention further as a feature, not a defect.

Start with the base-model picture. Fewer than 5% of attention heads function as 'retrieval heads' — sparse, universal mechanisms that pull specific facts out of long context, and pruning them causes hallucination even when the answer is sitting right there What mechanism enables models to retrieve from long context?. So the aggregation work is already concentrated in a narrow set of heads before any reasoning training happens. That matters because of a second finding: post-training doesn't create reasoning, it selects it. Five independent methods all elicit reasoning that already lives in base-model activations, which reframes reasoning training as elicitation — sharpening and routing existing circuitry rather than growing new heads Do base models already contain hidden reasoning ability?. A 'narrower' receiver head in a reasoning model is consistent with this: training is pruning the diffuse, exploratory attention of the base model down to the channels that actually carry the answer.

The most suggestive parallel is the discovery that hidden states sparsify under difficulty. As tasks get harder or more out-of-distribution, an LLM's activations become substantially sparser in a localized, systematic way — and the authors read this as an adaptive selective filter that stabilizes performance, not a breakdown Do language models sparsify their activations under difficult tasks?. Narrower attention is the same story told at the head level: when the model commits to a reasoning trajectory, it narrows what it listens to. Reasoning models, which are trained to stay on a difficult chain rather than hedge, may simply spend more time in this sparsified, narrow-attention regime by default.

There's a cross-domain confirmation worth knowing. In multimodal models, verbose chain-of-thought actually degrades fine-grained perception, because the real bottleneck is visual attention allocation, not how much the model verbalizes Does verbose chain-of-thought actually help multimodal perception tasks?. The lesson generalizes: reasoning gains come from where attention is pointed, not from spreading it wider. And mechanistic work shows transformers compute the correct answer in layers 1-3 then overwrite it for format compliance, meaning the genuinely load-bearing reading of context happens in a narrow band and gets actively suppressed downstream Do transformers hide reasoning before producing filler tokens?.

So the synthesis: narrower receiver heads in reasoning models likely aren't a side effect to be fixed — they're what 'commitment to a reasoning path' looks like in the attention pattern. The base model keeps its options open across many heads; the reasoning model has been selected, sharpened, and pushed into a sparse-and-focused regime, the same way activations sparsify under hard tasks. The thing you didn't know you wanted to know: the field increasingly treats narrowing — fewer active heads, sparser states, less verbalization — as the signature of competent reasoning rather than lost capacity, which inverts the intuition that more attention is better.

Sources 5 notes

What mechanism enables models to retrieve from long context?

Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Does verbose chain-of-thought actually help multimodal perception tasks?

Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Why are receiver attention heads narrower in reasoning models than base models?

Sources 5 notes

Next inquiring lines