Do decoder-only models have inherent architectural limits for non-sequential information?

This explores whether the left-to-right, causal design of decoder-only models structurally prevents them from handling information that isn't inherently ordered — and the corpus suggests the honest answer is 'partly, but the limit is more removable than it looks.'

This reads the question as asking whether the causal, left-to-right machinery of decoder-only models is a hard wall against non-sequential information — and the corpus splits into three answers that are worth holding together. The most direct evidence says yes, there's a real architectural tax: causal attention forces every token to attend only backward, which cripples these models when they're asked to build a holistic representation of a whole input. Why do decoder-only models underperform as text encoders? pins the bottleneck precisely — it's the causal masking, not model size, that makes decoder-only models weak text encoders. But the same note is also the escape hatch: simply switching the attention to bidirectional turns those same models into state-of-the-art encoders. So the 'limit' is a configuration choice baked into how the model reads, not an irreversible property of the weights.

There's a deeper, less negotiable version of the limit when the task is reasoning rather than encoding. Can recurrent hierarchies achieve reasoning that transformers cannot? shows that fixed-depth transformers sit under a genuine complexity ceiling (the AC0/TC0 class), which is why chain-of-thought collapses on tightly interdependent puzzles like Sudoku and mazes — problems where the answer can't be assembled by sweeping left to right but requires holding the whole board in mind and iterating. A recurrent, hierarchical model escapes that with only 27M parameters. That's the strongest case that something about the standard architecture, not just its training, struggles with information whose structure is simultaneous rather than sequential.

And yet the corpus refuses to let 'inherent' stand unchallenged. Can a single transformer become universally programmable through prompts? proves a single finite transformer exists that can compute any computable function given the right prompt — meaning the architecture is not formally bounded at all. The catch is that standard training rarely produces models that actually implement such programs. This reframes the whole question: the wall you hit in practice is usually a training-regime wall wearing an architecture costume, an idea echoed by Can non-reasoning models catch up with more compute?, where what looks like a capability ceiling turns out to be about how the model was trained to deploy its compute.

Worth knowing is that the field is routing around the sequential bias rather than waiting for a new architecture. Can neural memory modules scale language models beyond attention limits? bolts a separate long-term memory onto attention so the model isn't forced to re-derive everything from a linear sweep, and Is long-context bottleneck really about memory or compute? argues the real long-context constraint is the compute to consolidate scattered context into internal state — again a processing limit, not a storage one. Step back further and Are text-only language models fundamentally limited by abstraction? suggests the most stubborn 'non-sequential' problem isn't even attention shape: text itself has already flattened the geometry, physics, and causal structure of the world into a symbol stream, so the model never receives the non-sequential information in the first place.

The thing you didn't know you wanted to know: 'decoder-only' bundles together at least three separable limits — a removable encoding limit (causal masking), a real but escapable reasoning-depth limit (fixed-depth complexity classes), and a non-limit in principle that becomes a limit only through how we train. The architecture is less destiny than it appears; what's sequential is mostly the habit, not the hardware.

Sources 7 notes

Why do decoder-only models underperform as text encoders?

LLM2Vec's unsupervised 3-step process (bidirectional attention + masked prediction + contrastive learning) achieves SOTA on MTEB. The research shows causal masking, not model size, is the representation bottleneck in decoder-only encoders.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Can a single transformer become universally programmable through prompts?

Research proves a single finite-size transformer exists that can compute any computable function given the right prompt, achieving complexity bounds nearly matching unbounded models. However, standard training rarely produces models that learn to implement arbitrary programs this way.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Are text-only language models fundamentally limited by abstraction?

Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.

Do decoder-only models have inherent architectural limits for non-sequential information?

Sources 7 notes

Next inquiring lines