Large Language Diffusion Models

Paper · arXiv 2502.09992 · Published February 14, 2025

Is the autoregressive paradigm the only viable path to achieving the intelligence exhibited by LLMs?

we argue that scalability is primarily a consequence of the interplay between Transformers (Vaswani, 2017), model and data size, and Fisher consistency1 (Fisher, 1922) induced by the generative principles in Eq. (1), rather than a unique result of ARM.

Specifically, we construct a dataset of 496 famous Chinese poem sentence pairs. Given a sentence from a poem, models are tasked with generating the subsequent line (forward) or the preceding line (reversal) without additional fine-tuning.

LLaDA’s ability to generate coherent, fluent, and extended text in a non-autoregressive manner. Second, it highlights the model’s multi-turn dialogue capability, effectively retaining conversation history and producing contextually appropriate responses across multiple languages. Such chat capabilities of LLaDA are impressive, as it departs from conventional ARMs for the first time, to the best of our knowledge.