Reasoning and Learning Architectures Language Understanding and Reasoning Reasoning and Knowledge

Are text-only language models fundamentally limited by abstraction?

Explores whether text's compression of physics, geometry, and causality into symbols creates an irreducible ceiling for language-only AI, and whether multimodal approaches can overcome this structural constraint.

Note · 2026-05-18 · sourced from Multimodal

The foundation-model era was defined by language pretraining. Trillions of text tokens, autoregressive objectives, capabilities that surprised the field. The argument in Beyond Language Modeling is that this strategy has reached a structural ceiling — not for reasons of compute or data quantity but because of what text is.

Text is a human abstraction. When humans describe the world, we compress continuous physics into discrete symbols, lossy by construction. The high-fidelity physics, geometry, and causality that govern reality are stripped in the encoding. A language model trained on text inherits the abstraction's limits: it can manipulate symbols brilliantly without grounding them in the dynamics those symbols describe. To borrow the allegory of Plato's cave, text-only LLMs have mastered the descriptions of shadows on the wall without ever seeing the objects casting them.

The metaphor is doing real work, not just framing. It identifies a specific failure category — tasks that require reasoning about the source rather than the description. Physical reasoning about object interactions. Geometric reasoning about spatial relationships that text under-specifies. Causal reasoning about why something happens rather than what is described as happening. These are the failure clusters that text-only LLMs persistently underperform on, and the cave allegory predicts they should.

Beyond philosophy lies a hard pragmatic ceiling: high-quality text data is finite and approaching exhaustion. The compute side of the scaling curve has runway; the data side does not. The path forward requires moving beyond the shadows and modeling the source directly. Visual data preserves the physics, geometry, and causality that language strips, and the visual world's signal is essentially endless.

This reframes multimodal pretraining as not just an addition to language pretraining but the correction of an abstraction-induced limit. The text-only era was always going to hit this wall. The question is whether multimodal architectures can integrate the unfiltered signal without inheriting the limitations of how vision and language were previously combined.

Related concepts in this collection

Concept map
14 direct connections · 124 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

text-only LLMs are Plato cave models — text is a lossy human abstraction that captures shadows while missing physics geometry and causality of the source