A Survey on Diffusion Language Models

Paper · arXiv 2508.10875 · Published August 14, 2025
Diffusion LLMReasoning Methods CoT ToTNovel Architectures

Abstract—Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. Despite their growing prevalence, DLMs present challenges and opportunities that warrant further exploration, requiring a detailed and systematic understanding of their principles, techniques, and limitations. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field.

Discrete DLMs, on the other hand, define the diffusion process directly in token space. Early efforts such as D3PM [24] introduced structured transition matrices with absorbing states, allowing token-level corruption and iterative denoising. Subsequent work like DiffusionBERT [25] integrated pre-trained masked language models (e.g., BERT) to enhance denoising quality, and proposed tailored noise schedules (e.g., the spindle schedule) to better align token corruption with token frequency. These early models demonstrated the feasibility of applying iterative denoising to non-autoregressive text generation, offering controllability and parallelism, though their performance still lagged behind strong autoregressive baselines. As core challenges in DLMs are gradually addressed and the paradigm matures, larger-scale DLMs have been developed. By initializing from autoregressive models, 7B-level models like Dream [26] and DiffuLLaMA [27] have shown that DLMs can be effectively adapted from existing models while achieving competitive performance. LLaDA-8B [28] further demonstrates the potential of training DLMs from scratch, achieving performance comparable to similarly sized LLaMA3-8B models. Multimodal DLMs, also known as diffusion multimodal large language models (dMLLMs), have also shown promise in modeling hybrid data such as text and images. Built upon open-source DLMs,

Compared to autoregressive models, diffusion language models are widely believed to offer several distinct advantages as follows:

• Parallel Generation: DLMs can generate multiple tokens in parallel through an iterative denoising process, significantly improving inference speed and throughput over autoregressive models.

• Bidirectional Context: DLMs naturally incorporate bidirectional context, enabling more nuanced language understanding and generation. They also produce richer contextual embeddings, which are beneficial for cross-modal generation tasks. This enables fine-grained control over the generation process as well.

• Iterative Refinement: The iterative denoising process allows DLMs to update their perceptions over multiple steps. By accepting high-confidence tokens early and retaining low-confidence regions as masked, Masked DLMs can progressively improve uncertain areas, often resulting in more coherent and higher-quality text generation.

• Controllability: DLMs can be conditioned on specific token positions or structures, making them well-suited for tasks like infilling and structured-generation. Additionally, guidance techniques (e.g., classifier-free guidance) enable better control over style and semantic relevance.

• Unified Modeling Across Modalities: By applying a shared denoising-based modeling framework, DLMs naturally support unified text and vision generation tasks. This makes them particularly promising for multimodal applications that require both generation and understanding within a single model.

3.2 Post-training for Reasoning Capabilities Exploration of reasoning capabilities is becoming increasingly popular in DLMs as their performance on language tasks improves. Typically, reasoning capabilities are gained through fine-tuning on reasoning datasets. For DLMs, this presents a unique and formidable challenge. Traditional Chain-of-Thought (CoT) methods are based on the sequential nature of AR models to reason step-by-step, but DLMs generate tokens in parallel. The most successful post-training techniques in the AR domain, particularly those based on reinforcement learning (RL) and policy gradient methods, are built upon the ability to efficiently compute the log-probability of a generated sequence. This is straightforward in AR models due to their factorizable, sequential nature. In DLM, where generation is an iterative, nonsequential process, the log-likelihood is intractable, creating a significant technical barrier to applying the mature suite of RL algorithms to AR models. Intuitively, we categorized these works into three main streams, which form the structure of this subsection: (1) Parallelizing the reasoning chain, where CoT in AR models is adapted to DLMs in parallel generation. (2) Adapting policy gradient methods, where variants of popular algorithms like GRPO are introduced to DLMs. (3) Adapting preference optimization methods such as DPO to DLMs.

3.2.1 DoT and DCoLT: Parallelizing the Reasoning Chain

One of the pioneering works to elicit complex reasoning in DLMs is Diffusion-of-Thought (DoT) [81], which adapts the popular Chain-of-Thought paradigm to the diffusion framework. Instead of generating reasoning steps sequentially like autoregressive models, DoT formulates them as intermediate thoughts that are refined in parallel throughout the diffusion denoising process. The approach is implemented by fine-tuning pre-trained DLMs such as Plaid [66] and SEDD [67] on datasets containing problems and their corresponding step-by-step rationales. To enhance the model’s ability to recover from its own mistakes, DoT introduces specialized training techniques like scheduled sampling and coupled sampling, which exposes the model to its own generated errors during training to improve its self-correction capabilities. This post-training methodology enables smaller DLMs to achieve impressive reasoning performance, even outperforming significantly larger autoregressive models on certain mathematical and logical reasoning benchmarks.

A more recent approach, Diffusion Chain of Lateral Thought (DCoLT) [82], introduces a distinct RL-based reasoning framework inspired by the cognitive concept of lateral thinking, which contrasts with the step-by-step vertical thinking of traditional CoT methods. Instead of supervising intermediate steps, DCoLT treats each step of reverse diffusion process as a latent thinking action, but optimizes the entire multi-step denoising trajectory with outcome-based RL to maximize a reward on the final answer. When applied to masked DLMs like LLaDA, DCoLT innovatively introduces an Unmasking Policy Module (UPM), which learns the optimal order for revealing tokens as part of the RL action space. This approach significantly boosts the reasoning capabilities of DLMs, with the DCoLT-reinforced LLaDA model achieves gains of +9.8% on GSM8K and +19.5% on HumanEval.

7 APPLICATIONS ON DOWNSTREAM TASKS

7.1 Conventional NLP Tasks

Before the emergence of large-scale DLMs for general-purpose language generation, DLMs have already been applied to various conventional NLP tasks, such as text classification [99], named entity/scene recognition [100], [101], sentiment analysis [102], document summarization [103], [104], style transfer [110], [169], constrained generation [111]–[115], and machine translation [116], [170], etc.