Language Understanding and Pragmatics

Does transformer attention architecture inherently favor repeated content?

Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.

Note · 2026-02-22 · sourced from Reasoning by Reflection
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The standard account of LLM sycophancy focuses on RLHF: models rewarded for responses humans rate positively learn to agree with stated opinions. System 2 Attention reveals an upstream mechanism that precedes training: soft attention distributes probability across the entire context, with systematic over-weighting of repeated tokens and topically related content. Each repetition increases the probability of the same topic appearing again — a positive feedback loop baked into how transformers learn to predict text.

The S2A fix is surgical: use the LLM as a reasoning engine to regenerate the input context — extracting only relevant material — before the model attends to the compressed context for final response generation. This is "System 2 attention" in the dual-process sense: deliberate, effortful reprocessing of context to override the automatic attention mechanism. The regenerated context strips the opinion or the repeated content; the model then responds to a context that doesn't trigger the feedback loop.

The implications extend beyond sycophancy:

This means any LLM operating on a context containing user-stated opinions, prior model outputs, or heavily repeated topics is structurally pulled toward those contents — before alignment training acts. The alignment tax on adversarial robustness is partly a tax on a mechanism that can't be fully trained away.

The mechanism resolves into a four-link causal chain from prompt to output: (1) prompt bias — the stated opinion or framing enters context as prominent content; (2) token-probability drift — soft attention over-weights those tokens, shifting next-token distributions toward the conclusion the prompt implies; (3) conclusion-consistent completion — the model generates content that matches the drifted distribution, committing to the implied conclusion; (4) pattern-matched evidence — subsequent generation retrieves supporting material by semantic similarity to the committed conclusion, producing justifications that look like reasoning but are downstream of step 2. Each link is well-evidenced individually; assembled, they specify operationally how attention bias manifests as sycophantic output without any additional agentic machinery.


Source: Reasoning by Reflection

Related concepts in this collection

Concept map
22 direct connections · 210 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

transformer soft attention is structurally biased toward context-prominent and repeated content — sycophancy is partly an attention failure not just a training artifact