Can a single model generate all modalities without external encoders?
Most multimodal systems rely on separate encoders for each modality. This research explores whether training a unified foundation model on discrete tokens across text, image, video, and speech can enable any-to-any generation without those external components.
Most multimodal LLMs are dual-modal (text + one other modality) and rely on external encoders with alignment modules — they understand non-text inputs but rarely generate them, and can't produce interleaved multimodal sequences. MIO addresses this by training on a mixture of discrete tokens across four modalities (text, image, video, speech) with causal multimodal modeling, through a four-stage process (alignment → interleaved → speech-enhanced pretraining → comprehensive SFT). The result is end-to-end, autoregressive, any-to-any understanding and generation in one model, with emergent abilities the dual-modal baselines lack: interleaved video-text generation and chain-of-visual-thought reasoning.
The keeper is the design point: tokenize every modality into a shared discrete vocabulary and model them with one causal objective — which unlocks interleaved cross-modal output and reasoning that staple-on encoder approaches can't produce.
This sits in the vault's multimodal-architecture thread as the discrete-token, any-to-any design point. It is the tokenizer-based ("Type-D") instantiation contrasted with encoder-based unified generation, and it complements Why do unified image generators fail on non-Latin scripts?: unified any-to-any generation still inherits the frequency-driven data-bias documented in Does multimodal zero-shot performance actually generalize or interpolate?.
Inquiring lines that use this note as a source 4
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do discrete tokenized modalities preserve information better than continuous embeddings?
- What emergent abilities appear only in truly unified multimodal systems?
- How does causal multimodal modeling differ from encoder-decoder architectures?
- Can this whole-artifact principle apply to other generative tasks?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do unified image generators fail on non-Latin scripts?
GPT-4o excels at multimodal generation across 20+ tasks, but systematically fails to render non-Latin scripts and underrepresented cultures accurately. What explains this specific failure mode in otherwise capable systems?
any-to-any generation still inherits data-distribution bias
-
Does multimodal zero-shot performance actually generalize or interpolate?
Explores whether multimodal models like CLIP truly generalize to unseen concepts or whether their impressive performance merely reflects memorization of frequently-seen concepts during pretraining.
the frequency law that bounds even unified token-based multimodal models
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- MIO: A Foundation Model on Multimodal Tokens
- The Evolution of Multimodal Model Architectures
- Beyond Language Modeling: An Exploration of Multimodal Pretraining
- Emerging Properties in Unified Multimodal Pretraining
- Learn from your own latents and not from tokens: A sample-complexity theory
- A Survey on Diffusion Language Models
- MM-LLMs: Recent Advances in MultiModal Large Language Models
- No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance
Original note title
an any-to-any foundation model on discrete multimodal tokens enables interleaved generation and chain-of-visual-thought