What attentional bias objectives compete with dot product similarity for associative memory?

This explores what alternatives to plain query-key dot product decide what an attention-style associative memory stores and retrieves — surprise, repetition-prominence, or learned similarity functions — and how those objectives stack up.

This explores what alternatives to plain query-key dot product decide what an attention-style associative memory stores and retrieves. Standard transformer attention is, at bottom, a dot-product associative memory: a query is matched against keys by inner product, and the highest-scoring values get pulled forward. The corpus has several notes that, read together, show different objectives competing to govern that matching — and they don't all optimize for the same thing.

The first competing objective is **surprise** rather than similarity. The Titans architecture Can neural memory modules scale language models beyond attention limits? splits short-term attention (quadratic, dot-product based) from a long-term neural memory that prioritizes *surprising* tokens for storage. Instead of asking "what is most similar to my query," the memory asks "what violated my expectations enough to be worth keeping" — a gradient-of-surprise signal, not a dot product. That's a fundamentally different write objective, and it's what lets the model stretch past 2M tokens without paying attention's quadratic cost.

The second is a **structural prominence bias** baked into soft attention itself. The note on attention's bias toward repeated content Does transformer attention architecture inherently favor repeated content? shows that softmax doesn't weight purely by relevance — it systematically over-weights tokens that are repeated or context-prominent, creating a feedback loop that amplifies framing regardless of whether it answers the query. So even within dot-product attention, there's a hidden objective (prominence) riding alongside similarity, which is part of why sycophancy emerges. "System 2 Attention" — regenerating the context to strip irrelevant material — is essentially an attempt to subtract that competing bias out.

The third is the **learned-similarity-function** contest, and here the corpus is unusually direct: dot product wins. Two notes on Rendle et al. Can MLPs learn to match dot product similarity in practice? Why does dot product beat MLP-based similarity in practice? show that replacing the inner product with an MLP that *learns* its own similarity metric underperforms a well-tuned dot product, despite the MLP being a universal approximator. The inductive bias of geometric similarity beats raw expressiveness, and the dot product also survives where the MLP can't: it's the only form that supports efficient maximum-inner-product retrieval at scale. So the objective that competes hardest on paper — a freely learned similarity — loses on both accuracy and retrievability.

The through-line a curious reader might not expect: the dot product isn't winning because it's the most powerful matcher, but because the alternatives each trade something away. Surprise-based memory gives up similarity for compression and reach; prominence bias is an *unwanted* objective attention can't help but optimize; and learned MLP similarity gives up the geometric structure that makes retrieval tractable. Associative memory is less a search for the best similarity score than a negotiation among these competing pressures.

Sources 4 notes

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Does transformer attention architecture inherently favor repeated content?

Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.

Can MLPs learn to match dot product similarity in practice?

Rendle et al. show that carefully tuned dot products substantially outperform learned MLP similarities in collaborative filtering. MLPs require excessive capacity and data to match simple geometric similarity, and they cannot be efficiently retrieved at scale—proving inductive bias matters more than expressiveness.

Why does dot product beat MLP-based similarity in practice?

Rendle et al. show properly-tuned dot products substantially beat MLP-based similarity despite MLP universality. Learning a dot product with an MLP requires large models and datasets; dot products also enable efficient retrieval at production scale through MIPS algorithms.

What attentional bias objectives compete with dot product similarity for associative memory?

Sources 4 notes

Next inquiring lines