LLM Reasoning and Architecture

Can diffusion language models match autoregressive inference speed?

Diffusion LLMs promised faster decoding through parallel token generation, but open-source implementations never outpaced autoregressive models in practice. What architectural barriers prevent diffusion from realizing its speed potential?

Note · 2026-05-03 · sourced from Diffusion LLM

Diffusion LLMs were initially proposed in part for inference speed — they decode multiple tokens per iteration in principle, suggesting they should outpace autoregressive models. In practice, no open-source dLLM had achieved superior inference speed over AR LLMs of similar size. The paradox is that bidirectional attention, while it enables parallel generation within a step, costs more compute per step and prevents the KV-cache reuse that makes AR inference cheap.

Discrete Diffusion Forcing (D2F) breaks this barrier through a hybrid design that takes the speed advantage from each paradigm. The first capability is block-wise autoregressive generation — generating tokens in blocks rather than as a flat sequence — which permits KV cache reuse across blocks just as in AR models, eliminating the per-step compute overhead that bidirectional attention otherwise imposes. The second capability is prediction of following tokens without requiring completion of prior blocks, which enables inter-block parallel decoding and recovers the parallelism advantage that pure AR cannot offer.

The implementation matters as much as the design. D2F uses an asymmetric distillation process from pre-trained dLLMs, so existing dLLMs can be refurbished into the AR-diffusion hybrid paradigm without training from scratch. A pipelined parallel decoding algorithm provides a configurable trade-off between efficiency and efficacy, allowing deployment to choose its operating point.

The deeper lesson is that the AR-vs-diffusion framing has been a false dichotomy at inference time. The two paradigms decompose generation along different axes — AR along sequence position, diffusion along refinement step — and a hybrid that runs AR along blocks while running diffusion within and across blocks captures both kinds of parallelism. Architectural purity costs throughput; pragmatic hybrids win — convergent with the How should we balance parallel versus sequential compute at test time? pattern that mixed paradigms outperform pure ones.


Source: Diffusion LLM

Related concepts in this collection

Concept map
12 direct connections · 95 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

diffusion language models can achieve faster-than-autoregressive inference by hybridizing block-wise AR with inter-block parallelism