Beyond Language Modeling: An Exploration of Multimodal Pretraining
The foundation model era has been defined largely by the success of language pretraining. By scaling autoregressive models on trillions of text tokens, we have created systems with remarkable reasoning capabilities. Yet, fundamentally, text is a human abstraction—a lossy compression of reality. To borrow the allegory of Plato's Cave, language models have mastered the description of shadows on the wall without ever seeing the objects casting them. They capture symbols well but miss the high-fidelity physics, geometry, and causality of the physical world. Beyond this philosophical limitation lies a hard, pragmatic ceiling: high-quality text data is finite and approaching exhaustion. In contrast, the visual world possesses an endless stream of signal "outside the cave", capturing the raw dynamics of reality that language misses. As a result, the path forward requires moving beyond the shadows to model the source directly.
The scientific landscape of unified multimodal pretraining remains largely opaque. While recent efforts have begun to move beyond language-only pretraining, the design space is rife with confounding variables. Rather than jointly learning from vision and language from scratch, most current methodologies rely on initialization from pretrained language models. This paradigm prioritizes preserving existing language capabilities while adapting the model to become multimodal. Moreover, the knowledge already embedded in these pretrained backbones confounds any conclusions drawn about the multimodal training itself, making it difficult to disentangle what is learned from unified training versus what is inherited from language pretraining. Consequently, the fundamental dynamics and scaling relationships between vision and language remain poorly understood.
We set out to bring empirical clarity to the design space of unified multimodal pretraining. A central question in this space is whether vision and language can coexist in a single model without mutual degradation. By training from scratch and systematically studying one variable at a time, we find that modality competition is not a fatal flaw of multimodal pretraining—it is just a symptom of specific design choices. Previous work has assumed that separate visual representations are necessary for understanding and generation. Our results show that a single high-dimensional semantic representation (RAE) excels at both visual understanding and generation, and our MoE analysis confirms that the same experts are often activated for both tasks. This convergence was fully learned from the data without any human priors. Furthermore, RAE representations continue to improve as capacity scales, while VAE-based methods exhibit loss saturation, suggesting that semantic representations are better suited as the foundation for scaling unified models.
Modality competition is largely solvable. A common assumption is that vision and language are inherently at odds within a single model, meaning that training on one modality necessarily degrades the other. Our findings paint a different picture: the "modality tax" has two identifiable sources, neither of which is the visual modality itself. The first is friction stems from distributional shifts in image-text captions, not from vision itself. Pure video is complementary to language, and general multimodal pretraining yields positive transfer for VQA and world modeling. The second is architectural: dense models rigidly allocate capacity between modalities, which modality-specific FFNs partially address, and MoE further resolves by learning to allocate capacity per token.
Our IsoFLOP analysis reveals that MoE resolves a fundamental scaling asymmetry: in dense models, language follows Chinchilla-like balanced allocation while vision is significantly more data-hungry, making it impossible to optimize both simultaneously. However, in the sparse MoE regime, language scaling shifts toward a more data-hungry regime, aligning with vision scaling. This suggests that sparsity plays a deeper role in multimodal models than efficiency alone—it provides the structural flexibility for enabling modalities with fundamentally different scaling behaviors to coexist.
Flow Theory describes an optimal cognitive state where individuals experience deep focus and intrinsic motivation when a task’s difficulty aligns with their skill level. In AI-augmented reasoning, interventions that disrupt the state of cognitive flow can hinder rather than enhance decision-making. This paper proposes a context-aware cognitive augmentation framework that adapts interventions based on three key contextual factors: type, timing, and scale. By leveraging multimodal behavioral cues (e.g., gaze behavior, typing hesitation, interaction speed), AI can dynamically adjust cognitive support to maintain or restore flow. We introduce the concept of cognitive flow, an extension of flow theory in AI-augmented reasoning, where interventions are personalized, adaptive, and minimally intrusive. By shifting from static interventions to context-aware augmentation, our approach ensures that AI systems support deep engagement in complex decision-making and reasoning without disrupting cognitive immersion.