SYNTHESIS NOTE
Model Architecture and Internals

Can a single model generate all modalities without external encoders?

Most multimodal systems rely on separate encoders for each modality. This research explores whether training a unified foundation model on discrete tokens across text, image, video, and speech can enable any-to-any generation without those external components.

Synthesis note · 2026-06-03 · sourced from Multimodal

Most multimodal LLMs are dual-modal (text + one other modality) and rely on external encoders with alignment modules — they understand non-text inputs but rarely generate them, and can't produce interleaved multimodal sequences. MIO addresses this by training on a mixture of discrete tokens across four modalities (text, image, video, speech) with causal multimodal modeling, through a four-stage process (alignment → interleaved → speech-enhanced pretraining → comprehensive SFT). The result is end-to-end, autoregressive, any-to-any understanding and generation in one model, with emergent abilities the dual-modal baselines lack: interleaved video-text generation and chain-of-visual-thought reasoning.

The keeper is the design point: tokenize every modality into a shared discrete vocabulary and model them with one causal objective — which unlocks interleaved cross-modal output and reasoning that staple-on encoder approaches can't produce.

This sits in the vault's multimodal-architecture thread as the discrete-token, any-to-any design point. It is the tokenizer-based ("Type-D") instantiation contrasted with encoder-based unified generation, and it complements Why do unified image generators fail on non-Latin scripts?: unified any-to-any generation still inherits the frequency-driven data-bias documented in Does multimodal zero-shot performance actually generalize or interpolate?.

Inquiring lines that use this note as a source 4

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 102 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

an any-to-any foundation model on discrete multimodal tokens enables interleaved generation and chain-of-visual-thought