What emergent abilities appear only in truly unified multimodal systems?
This explores what new capabilities show up specifically when a model is genuinely unified across modalities — generating and reasoning over images, video, and text in one system — rather than bolting a vision encoder onto a language model, and whether those 'emergent' abilities are even real.
This explores what new capabilities show up specifically when a model is genuinely unified across modalities — and the corpus has a sharp answer alongside a sharp caveat. The clearest evidence comes from MIO Can a single model generate all modalities without external encoders?, which trains one model on a shared stream of discrete tokens across four modalities. Two abilities appear that encoder-bolted-together systems simply can't do: interleaved video-text generation (weaving frames and words into one output) and chain-of-visual-thought — reasoning that unfolds *in images*, not just describing images in words. The point is that these aren't better versions of existing skills; they're behaviors that only exist when generation runs in both directions through a single representation.
But why don't more systems get there? The bottleneck turns out to be architectural, not fundamental. Work on modality competition Can we solve modality competition through architectural design? shows that vision and language aren't inherently incompatible — they fight over a model's fixed capacity, and a Mixture-of-Experts design that allocates capacity per token lets them coexist. So 'truly unified' is less a philosophical threshold and more an engineering one: give each modality room and the competition that degrades joint systems dissolves.
There's a deeper reason unification unlocks things text alone can't reach. The Plato's-cave argument Are text-only language models fundamentally limited by abstraction? holds that text strips out the physics, geometry, and causality of the world, leaving language models manipulating symbols cut off from their source dynamics — which is exactly where they predictably fail. Multimodal grounding is the proposed fix. The grounding analysis What grounds language understanding in systems without embodiment? sharpens this: models can achieve *functional* grounding through language patterns, but causal grounding requires actual contact with the world, and that's an architectural change, not more training. Unified multimodal systems are one route toward closing that gap.
Here's the part you didn't know you wanted: the word 'emergent' should make you suspicious. A pointed study Are LLM emergent abilities real or measurement artifacts? shows that many celebrated 'emergent' jumps in language models vanish when you measure with a continuous metric instead of a pass/fail one — the capability was improving smoothly all along; the *measurement* created the cliff. Applied here, that's a discipline: an ability counts as genuinely emergent in a unified system only if it's a behavior the modular alternative cannot produce *at all* (like MIO's visual reasoning), not just a benchmark score that leaps because of how you scored it.
Finally, there's a ceiling worth naming. Even models that excel without unification — predicting human social norms better than people Can AI systems learn social norms without embodied experience? — make identical systematic errors that hint at limits pattern-matching can't cross, and a stronger claim holds that some capacities (consciousness, full causal understanding) require embodied co-presence in a shared world Can disembodied language models ever qualify as conscious?. So unification expands what's reachable — visual reasoning, cross-modal generation, better grounding — but it's a step along the embodiment axis, not the end of it.
Sources 7 notes
MIO trains a foundation model on mixed discrete tokens across four modalities with causal modeling, achieving end-to-end generation in both directions. The model emergently produces interleaved video-text output and chain-of-visual-thought reasoning that dual-modal encoder-based systems cannot.
Modality competition arises from caption distributional shift and rigid dense capacity allocation, not from vision and language being fundamentally incompatible. Mixture of Experts resolves the architectural bottleneck by allocating capacity per token, enabling modalities to coexist without competing.
Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.
Language models achieve functional grounding through relational language patterns but lack social grounding through participatory agency and causal grounding through embodied environmental contact. Social grounding can increase through human integration, but linguistic agency requires architectural changes beyond training.
Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.
GPT-4.5 predicted appropriateness of 555 social scenarios at the 100th percentile compared to human raters, with Gemini and Claude also exceeding 96% accuracy. However, all models show identical systematic errors, revealing boundaries of pattern-based social understanding that embodied experience may still be necessary to cross.
Current disembodied LLMs cannot be candidates for consciousness because consciousness language originates from and applies only to entities sharing a world with us through co-presence and triangulation on shared objects.