Can diffusion models commit to answers before full decoding?
Do diffusion language models settle on correct answers early in their refinement process, and if so, can we detect and exploit this convergence to speed up inference without losing quality?
Diffusion LMs are slower than AR models at inference, primarily because of the cost of bidirectional attention and the large number of refinement steps required for high-quality outputs. The standard assumption is that more refinement equals better answers, so cutting refinement budget should cost accuracy. This paper documents a counterintuitive empirical property: early answer convergence. In many cases, the correct answer can be internally identified by half the refinement steps before the final decoding step — for GSM8K, up to 97% of instances; for MMLU, up to 99%. The pattern holds under both semi-autoregressive and random remasking schedules.
This reveals a fundamental redundancy in conventional full-length slow decoding. Most of the latter half of decoding is not improving the answer — it is just maintaining an answer the model already settled on. The right framing is that DLM decoding is a stopping problem: when is it safe to commit and emit the answer rather than continuing to refine?
Prophet operationalizes this insight as a training-free fast decoding paradigm that monitors the confidence gap between the top-2 prediction candidates and dynamically decides whether to continue refinement or "go all-in" — decode all remaining tokens in one step. The confidence gap serves as a reliable signal for when the model has internally committed; once it has, additional refinement is wasted compute. The mechanism integrates seamlessly into existing DLM implementations with negligible overhead and requires no additional training.
Empirically on LLaDA-8B and Dream-7B across multiple tasks, Prophet reduces decoding steps by up to 3.4× while preserving generation quality. The structural lesson generalizes beyond DLMs: any iterative-refinement model with monitorable internal confidence has a stopping problem rather than a fixed budget, and treating refinement steps as a hyperparameter rather than a runtime decision leaves substantial compute on the table — the same diagnosis Does reflection in reasoning models actually correct errors? reaches for AR reasoning.
Source: Diffusion LLM
Related concepts in this collection
-
Can reasoning and answers be generated separately in language models?
Explores whether diffusion LLMs can embed reasoning prompts directly within generation sequences rather than as prefixes, and whether answers and reasoning can be decoupled as independent refinement axes.
extends: ICE uses the same early-convergence property as a confidence-aware exit signal — but at the prompt+answer joint structure level rather than at the whole-sequence level
-
Can diffusion language models match autoregressive inference speed?
Diffusion LLMs promised faster decoding through parallel token generation, but open-source implementations never outpaced autoregressive models in practice. What architectural barriers prevent diffusion from realizing its speed potential?
complements: two attacks on the diffusion speed gap — D2F changes the architecture; Prophet stops early without changing it
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
exemplifies: same redundancy story in AR reasoning — most refinement after first answer is maintenance not improvement
-
Does chain-of-thought reasoning reflect genuine thinking or performance?
When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.
complements: AR analogue — early commitment on easy tasks parallels Prophet's early convergence in the diffusion case
-
When should an agent actually stop and deliberate?
How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
complements: same conditional-deliberation principle applied to agent actions rather than to refinement steps
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
diffusion language models know the answer well before decoding completes — up to 99 percent of MMLU instances are correctly resolvable at half the refinement budget