Why do diffusion models fail at inherently sequential problems?

This explores why a model that generates a whole sequence in parallel — refining all tokens at once through denoising — struggles with problems where each step genuinely depends on the result of the step before it.

This reads the question as a clash between two ways of producing an answer: diffusion models build text by refining the entire sequence in parallel, while some problems can only be solved by working through steps one at a time. The corpus suggests the failure isn't a bug in any particular diffusion model — it's baked into what makes diffusion fast in the first place.

The clearest statement of the cost comes from work showing that sequential chain-of-thought has an *exponential* advantage over parallel approaches on genuinely compositional tasks like tracing connectivity through a graph When does sequential reasoning beat parallel voting?. The reason is concrete: the solution requires accumulating intermediate results in order, and no amount of guessing the whole answer at once can substitute for actually carrying the chain forward. Parallel sampling explores breadth; it can't manufacture a dependency that has to be computed in sequence. That's the shape of the problem diffusion runs into.

What's striking is that the very mechanism behind this weakness is also diffusion's headline strength. Because diffusion uses continuous latent variables, gradients can flow across the entire sequence simultaneously — which lets it do global control (length, syntax, infilling) that autoregressive models can't easily reach Can diffusion models enable control that autoregressive models cannot reach?. The same parallel, non-sequential generation, though, is exactly what makes reinforcement learning hard to graft on: there's no clean left-to-right factorization of probability, so the likelihood becomes intractable and you have to marginalize over all the denoising paths Why can't we easily adapt reinforcement learning to diffusion language models?. Parallelism and sequential reasoning are trading against each other, not lining up.

There's a subtler wrinkle worth knowing. Diffusion models tend to *commit early* — up to 99% of some benchmark answers are locked in by the midpoint of decoding Can diffusion models commit to answers before full decoding?. For pattern-recall tasks that's a free speedup. But early commitment is precisely the failure mode that sinks sequential problems elsewhere: language models in multi-turn conversation collapse when they lock onto a premature assumption before the full problem is revealed, and they can't recover from it Why do language models fail in gradually revealed conversations?. A sequential problem is one where information arrives — or has to be derived — in order, and a model that fixes its guess too soon forecloses the later steps that would correct it.

So the deeper answer is that "sequential" names two things diffusion gives up at once: the step-by-step *computation* that compositional problems require, and the step-by-step *revision* that lets a reasoner change its mind as it goes. Interestingly, the corpus also points to the reverse repair job — researchers bolting sequential structure back onto parallel systems, like learning the order in which to unmask tokens Why can't we easily adapt reinforcement learning to diffusion language models?, which is a quiet admission that for some problems, order was never optional.

Sources 5 notes

When does sequential reasoning beat parallel voting?

On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.

Can diffusion models enable control that autoregressive models cannot reach?

Diffusion-LM succeeds on six fine-grained control tasks (syntax, semantics, infilling, length) where plug-and-play methods fail. Its continuous latent variables allow gradients to flow across the entire sequence simultaneously, replacing the discrete-token bottleneck and enabling parallel denoising.

Why can't we easily adapt reinforcement learning to diffusion language models?

Diffusion language models cannot directly use AR-developed RL methods like GRPO and DPO because iterative non-sequential token generation requires marginalizing over denoising trajectories, making likelihood intractable. Workarounds exist—outcome-based rewards, policy learning for unmasking order, and adapted preference optimization—enabling models like DCoLT to gain 9–19% on benchmarks.

Can diffusion models commit to answers before full decoding?

Up to 99% of MMLU instances and 97% of GSM8K instances reach correct answers by the midpoint of refinement. Prophet exploits this by monitoring confidence gaps to stop early, achieving 3.4× speedup with no quality loss.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Why do diffusion models fail at inherently sequential problems?

Sources 5 notes

Next inquiring lines