What architectural alternatives can capture compositional structure beyond pooled cosine?
This explores why simple pooled-embedding cosine similarity can't represent order-sensitive, compositional meaning — and which architectural moves the corpus offers as alternatives that preserve structure rather than averaging it away.
This explores the limits of the workhorse retrieval setup — pool a sequence into one vector, compare with cosine — and what the corpus suggests once you need to capture *structure*, not just topical overlap. The cleanest diagnosis of the problem is geometric: unit-sphere cosine spaces force concepts into linear superposition, which is commutative, so they literally cannot robustly tell "dog bit man" from "man bit dog" or handle negation, no matter how you train them Why can't cosine space retrievers distinguish word order?. Pooling is the culprit — collapsing a sequence into a point throws away the non-commutative information that composition depends on. So the question isn't "train a better encoder," it's "stop discarding the structure before you compare."
The first family of alternatives keeps the parts addressable instead of averaging them. That same geometry note points to token-level interaction (late-interaction style matching) or a downstream verification pass as the architectural escape hatches Why can't cosine space retrievers distinguish word order?. This dovetails with evidence that compositional ability already lives in the parts: neural networks naturally decompose compositional tasks into isolated modular subnetworks, where ablating one subroutine affects only its corresponding function — and pretraining sharpens that modularity Do neural networks naturally learn modular compositional structure?. The implication is that the representational structure is *present* but pooling destroys access to it; an architecture that preserves per-part signals and matches at that granularity is recovering something the model already has.
There's a sobering counter-current, though. Even full transformers often don't do composition the way we hope: they succeed in-distribution by memorizing computation subgraphs and reduce "reasoning" to linearized subgraph matching, then fail badly on novel combinations Do transformers actually learn systematic compositional reasoning?. And a deeper warning — identical accuracy can mask fractured internal representations, so a metric that looks fine can hide broken organization that shatters under perturbation or distribution shift Can models be smart without organized internal structure?. Together these say the bar for "captures compositional structure" should be generalization to unseen compositions, not score on a held-in benchmark.
Where does real compositional capacity come from architecturally? The corpus offers two divergent answers worth holding side by side. One says scale, not structure: standard MLPs achieve compositional generalization through data and model size alone — provided training covers the combinations — and linear decodability of constituents predicts when it works Can neural networks learn compositional skills without symbolic mechanisms?. The other says depth and recurrence: deep-and-thin models compose abstract concepts across layers better than wide ones at small scale Does depth matter more than width for tiny language models?, and hierarchical dual-recurrence couples slow planning with fast computation to break the fixed-depth complexity ceiling that flat transformers can't escape Can recurrent hierarchies achieve reasoning that transformers cannot?. The tension — emergent-from-coverage vs. engineered-via-depth — is the actual design choice once you leave pooled cosine behind.
The most radical reframing is that some compositional structure isn't a representation problem at all but a *generation* problem. Autoregressive decoding can't retract emitted tokens, so it fails at constraint satisfaction the way a single vector fails at word order — the needed primitive is missing from the architecture, which is why bolting on a symbolic solver works Why does autoregressive generation fail at constraint satisfaction?. Read alongside the cosine note, a pattern emerges: pooled embeddings lack the primitive to *represent* order; autoregression lacks the primitive to *revise* structure. The interesting alternatives — late interaction, modular/subnetwork addressing, recurrent depth, hybrid verification, symbolic hand-off — are all ways of restoring a primitive that the convenient default architecture quietly removed.
Sources 8 notes
Unit-sphere cosine spaces force concepts into linear superposition, a commutative structure that cannot robustly represent non-commutative distinctions like "dog bit man" versus "man bit dog." This geometric constraint persists regardless of training procedure and requires architectural alternatives like token-level interaction or downstream verification.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.