What is selective resonance and why do transformers not perform it?
This explores 'selective resonance' — the idea that understanding meaning requires suppressing irrelevant interpretations and letting the right frame ring out, and why the transformer's parallel attention mechanism does the opposite.
This explores a single sharp claim from the corpus: that human comprehension works by *selective resonance* — when you read a pun or a setup line, your mind quietly silences the meanings that don't fit and lets one frame light up — and that transformers structurally can't do this. The closest the collection comes to defining it directly is the finding that Why do AI systems miss jokes and wordplay so consistently? AI integrates tokens through *weighted parallel aggregation* rather than selective suppression. In plain terms: a transformer adds up the contributions of all the words at once, dialing each up or down, but it never fully mutes the wrong reading. Resonance is subtractive — it kills the irrelevant so the relevant stands out. Attention is additive — it blends everything. That difference is offered not as a knowledge gap but as a missing cognitive operation, which is why jokes, wordplay, and frame-dependent meaning fail so consistently regardless of model size.
What makes this interesting is how well it rhymes with other corpus findings about what transformers are actually doing when they look like they understand. Several notes converge on the same underlying picture: the model isn't building meaning, it's matching patterns it has already seen. Work on compositional reasoning shows transformers succeed by Do transformers actually learn systematic compositional reasoning? memorizing computation subgraphs from training rather than applying systematic rules — and collapse on novel combinations. The world-models probe shows the same thing from another angle: foundation models trained on physics or games develop Do foundation models learn world models or task-specific shortcuts? slice-dependent heuristics, not a unified understanding of structure. Both are what you'd expect from a system that aggregates rather than selectively resonates: it can interpolate across familiar territory but has no mechanism to *choose* one coherent interpretation and discard the rest.
The lateral payoff is in the architectural notes, because they suggest selective resonance might be less an inherent limit and more a consequence of the flat, fixed-depth design. Multi-hop reasoning, when it does emerge, shows a How do transformers learn to reason across multiple steps? cosine-clustering signature — entity representations literally separating into groups, a faint hint of frames forming under pressure. And approaches that break the flat-aggregation mold do better at exactly the kinds of structured tasks plain transformers fumble: explicit stack tracking gives Can explicit stack tracking improve how transformers learn recursive syntax? large gains on recursive syntax, while recurrent and hierarchical depth lets models Can recurrent hierarchies achieve reasoning that transformers cannot? escape the complexity ceiling that constrains fixed-depth attention. None of these is 'selective resonance' by name, but each adds the missing ingredient: a mechanism that commits to a structured state instead of averaging over all possibilities.
The thing you may not have known you wanted to know: the transformer's signature strength — attending to everything in parallel — is the very property that makes resonance impossible. Resonance requires *not* attending to most things. So the failure on a joke isn't a quirk of training data; it's the flip side of the architecture that makes transformers so good at fluent, broad-context blending in the first place.
Sources 6 notes
Transformers integrate token information through weighted parallel aggregation rather than selective suppression of irrelevant words. This structural difference explains consistent failures with jokes, wordplay, and frame-dependent meaning—not knowledge gaps, but missing cognitive operations.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Inductive bias probes show transformers trained on orbital mechanics and games learn predictive patterns, not unified world structure. Fine-tuning reveals nonsensical, slice-dependent laws; circuit analysis shows arithmetic relies on range-matching heuristics, not algorithms.
Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.
Pushdown Layers—a drop-in self-attention replacement with explicit stack tracking—achieve 3-5x more sample-efficient syntactic generalization while maintaining perplexity. The improvement shows that recursive structure specifically benefits from architectural inductive bias despite general compositional generalization emerging from scale.
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.