Why does attention-based drift happen automatically during generation?
This explores why transformer outputs tend to slide toward whatever's already prominent in the context — not as a bug someone introduced, but as a built-in consequence of how attention weighs tokens while it generates.
This explores why transformer outputs tend to slide toward whatever's already prominent in the context — not as a bug someone introduced, but as a built-in consequence of how attention weighs tokens while it generates. The short version: soft attention is structurally biased toward content that's repeated or already prominent, and because each new token is conditioned on the tokens before it, that bias compounds on itself the moment generation starts. The corpus frames this as a feedback loop, not a stylistic accident — attention systematically over-weights repeated and context-prominent tokens regardless of whether they're actually relevant, which amplifies opinions, framing, and sycophancy before any alignment training even gets a chance to intervene Does transformer attention architecture inherently favor repeated content?. The drift is automatic because the architecture is doing exactly what it was built to do; it just has no native brake.
The "during generation" part matters more than it first appears. A transformer doesn't store a finished thought and read it out — it transmits knowledge as a continuous flow of activations, generated fresh at each step rather than retrieved from a fixed archive Do transformer models store knowledge or generate it continuously?. That means there's no stable reference copy to drift away from; the output *is* the process. And the process never pauses to reconsider. Token ordering is sequential but atemporal — probabilistic selection without any intervening moment of reflection or revision Does AI text generation unfold through temporal reflection?. A human writer drifts and then notices and corrects; the model has no duration in which noticing could happen, so small pulls toward prominent content accumulate uninterrupted.
You can see the mechanism sharpen when you look at where errors actually enter. In chain-of-thought reasoning, the dominant failure source is *local* memorization — predictions over-anchored on the immediately preceding tokens, accounting for up to two-thirds of reasoning errors, and getting worse as complexity rises Where do memorization errors arise in chain-of-thought reasoning?. That's drift in miniature: the nearest, most prominent context wins the next-token competition even when it shouldn't. Relatedly, transformers integrate token information by weighted parallel aggregation — adding everything up — rather than selectively suppressing what's irrelevant, which is why they miss jokes and frame-dependent meaning Why do AI systems miss jokes and wordplay so consistently?. The same missing operation (selective suppression) is what would otherwise let a model resist being pulled by whatever's loudest in the window.
Here's the part you might not have known you wanted: the drift isn't inevitable, and the fixes target exactly the mechanism above. Because the bias lives in how context is attended to, you can interrupt it by rewriting the context itself — System 2 Attention regenerates the prompt to strip irrelevant material before the model attends to it, breaking the feedback loop at its source Does transformer attention architecture inherently favor repeated content?. A different angle: only a sparse few percent of attention heads actually do faithful long-context retrieval, and they're causally necessary for factuality — prune them and the model hallucinates despite the right information sitting in context What mechanism enables models to retrieve from long context?. So drift is partly a story about the *non*-retrieval heads dominating. And architecturally, separating short-term attention from a dedicated long-term memory that prioritizes surprising tokens is one bet on giving generation something more stable than prominence to lean on Can neural memory modules scale language models beyond attention limits?. The throughline across all of these: attention drift is automatic because prominence, not relevance, is the default currency of generation — and every mitigation is really an attempt to change what the model is allowed to find prominent.
Sources 7 notes
Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.
Token ordering in LLMs follows probabilistic selection without intervening reflection or revision. Human discourse gains meaning from temporal structure—time spent thinking changes what comes next—but AI text production lacks this duration-in-reflection despite appearing sequentially composed.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Transformers integrate token information through weighted parallel aggregation rather than selective suppression of irrelevant words. This structural difference explains consistent failures with jokes, wordplay, and frame-dependent meaning—not knowledge gaps, but missing cognitive operations.
Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.