Can diffusion models perform infilling and reverse generation as naturally as forward generation?
This explores whether diffusion models — which build text by refining all positions at once rather than left-to-right — can fill in gaps and run 'backwards' as easily as they generate forwards, and what their bidirectional nature buys them that autoregressive models can't get.
This explores whether diffusion models handle infilling and reverse generation as naturally as forward generation. The short version the corpus suggests: yes, and this is arguably the whole point of them. Autoregressive models — the GPT-style left-to-right generators — commit to each token before seeing what comes next, which makes any task that isn't "continue forward" awkward. Diffusion models instead start from noise (or a fully masked sequence) and refine every position simultaneously, so there's no privileged direction. Infilling stops being a special trick and becomes the default mode of operation.
The clearest evidence is Diffusion-LM, which succeeds on six fine-grained control tasks — including infilling, length, syntax, and semantics — precisely where left-to-right plug-and-play methods fail Can diffusion models enable control that autoregressive models cannot reach?. Because its latent variables are continuous, gradients flow across the entire sequence at once, so a constraint at the end can reach back and reshape the beginning. That bidirectionality shows up again in "in-place prompting," where reasoning is embedded directly into masked positions and refined alongside the answer, rather than having to precede it in sequence — something an autoregressive model structurally cannot do Can reasoning and answers be generated separately in language models?.
The deeper reason forward isn't special for these models is architectural. Autoregressive generation can never retract a token it has already emitted, which is why it hits a hard ceiling on constraint-satisfaction problems that depend on discarding bad partial guesses Why does autoregressive generation fail at constraint satisfaction?. Diffusion's iterative refinement is closer to that retract-and-revise loop — you can think of denoising as repeatedly proposing and correcting, which one note frames as mathematically the same selection-and-mutation process as evolutionary search Can diffusion models perform evolutionary search in parameter space?. "Reverse" generation isn't a separate capability bolted on; it falls out of having no committed direction in the first place.
What's surprising is that this flexibility doesn't cost you scale or even speed, which is where most people expect the catch. LLaDA shows non-autoregressive diffusion matching autoregressive scaling, implying that left-to-right factorization is a contingent choice rather than the source of LLMs' power Does autoregressive generation uniquely enable LLM scaling?. And diffusion models converge on the correct answer remarkably early — up to 99% of MMLU and 97% of GSM8K instances are right by the halfway point of refinement, enabling big early-exit speedups Can diffusion models commit to answers before full decoding?. Hybrid schemes even recover autoregressive-level inference efficiency while keeping the parallelism Can diffusion language models match autoregressive inference speed?.
The honest caveat, and the thing you didn't know you wanted to know: that same any-direction freedom is exactly what makes diffusion models hard to train with reinforcement learning. Because tokens emerge in parallel rather than in a fixed order, the clean log-likelihood factorization that RL methods like GRPO and DPO rely on falls apart, forcing researchers into workarounds like outcome-based rewards and learned unmasking orders Why can't we easily adapt reinforcement learning to diffusion language models?. So infilling and reverse generation come naturally — but the absence of a canonical order that grants them also breaks tooling that quietly assumed one.
Sources 8 notes
Diffusion-LM succeeds on six fine-grained control tasks (syntax, semantics, infilling, length) where plug-and-play methods fail. Its continuous latent variables allow gradients to flow across the entire sequence simultaneously, replacing the discrete-token bottleneck and enabling parallel denoising.
ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.
Denoising in diffusion models performs selection, mutation, and reproductive isolation—the core mechanisms of evolution. Diffusion Evolution empirically outperforms mainstream evolutionary algorithms by preserving multimodality where traditional methods collapse to single solutions.
LLaDA demonstrates that non-autoregressive diffusion models match autoregressive scaling performance. This suggests scalability emerges from the interplay of architecture, dataset size, and Fisher-consistent principles—meaning autoregressive factorization is contingent rather than necessary.
Up to 99% of MMLU instances and 97% of GSM8K instances reach correct answers by the midpoint of refinement. Prophet exploits this by monitoring confidence gaps to stop early, achieving 3.4× speedup with no quality loss.
Discrete Diffusion Forcing breaks the speed barrier through block-wise autoregressive generation with KV cache reuse and inter-block parallel decoding. This hybrid approach recovers both the compute efficiency of AR and the parallelism advantage of diffusion.
Diffusion language models cannot directly use AR-developed RL methods like GRPO and DPO because iterative non-sequential token generation requires marginalizing over denoising trajectories, making likelihood intractable. Workarounds exist—outcome-based rewards, policy learning for unmasking order, and adapted preference optimization—enabling models like DCoLT to gain 9–19% on benchmarks.