SYNTHESIS NOTE

Can a single model generate all modalities without external encoders?

Most multimodal systems rely on separate encoders for each modality. This research explores whether training a unified foundation model on discrete tokens across text, image, video, and speech can enable any-to-any generation without those external components.

Synthesis note · 2026-06-03 · sourced from Multimodal

Most multimodal LLMs are dual-modal (text + one other modality) and rely on external encoders with alignment modules — they understand non-text inputs but rarely generate them, and can't produce interleaved multimodal sequences. MIO addresses this by training on a mixture of discrete tokens across four modalities (text, image, video, speech) with causal multimodal modeling, through a four-stage process (alignment → interleaved → speech-enhanced pretraining → comprehensive SFT). The result is end-to-end, autoregressive, any-to-any understanding and generation in one model, with emergent abilities the dual-modal baselines lack: interleaved video-text generation and chain-of-visual-thought reasoning.

The keeper is the design point: tokenize every modality into a shared discrete vocabulary and model them with one causal objective — which unlocks interleaved cross-modal output and reasoning that staple-on encoder approaches can't produce.

This sits in the vault's multimodal-architecture thread as the discrete-token, any-to-any design point. It is the tokenizer-based ("Type-D") instantiation contrasted with encoder-based unified generation, and it complements Why do unified image generators fail on non-Latin scripts?: unified any-to-any generation still inherits the frequency-driven data-bias documented in Does multimodal zero-shot performance actually generalize or interpolate?.

Inquiring lines that use this note as a source 4

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 102 in 2-hop network ·medium cluster Open in graph ↗

Can a single model generate all modalities witho… Why do unified image generators fail on non-Latin … Does multimodal zero-shot performance actually gen…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do unified image generators fail on non-Latin scripts? GPT-4o excels at multimodal generation across 20+ tasks, but systematically fails to render non-Latin scripts and underrepresented cultures accurately. What explains this specific failure mode in otherwise capable systems?
any-to-any generation still inherits data-distribution bias
Does multimodal zero-shot performance actually generalize or interpolate? Explores whether multimodal models like CLIP truly generalize to unseen concepts or whether their impressive performance merely reflects memorization of frequently-seen concepts during pretraining.
the frequency law that bounds even unified token-based multimodal models

Can a single model generate all modalities without external encoders?

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4