LLM Reasoning and Architecture Reinforcement Learning for LLMs

Why can't we easily adapt reinforcement learning to diffusion language models?

Autoregressive models enable efficient RL post-training through factorizable log-probabilities, but diffusion models generate tokens in parallel non-sequential order. What makes likelihood computation intractable in diffusion, and can we work around it?

Note · 2026-05-03 · sourced from Diffusion LLM

The maturation of post-training techniques in autoregressive LLMs — RLHF, RLAIF, GRPO, DPO — has been a major source of capability gain. These methods all rely on the ability to efficiently compute the log-probability of a generated sequence, which is straightforward in AR models because the joint probability factorizes along sequence position. Each token's probability is conditioned on the prior tokens, so the sequence log-probability is just a sum of token log-probabilities computed in a single forward pass.

In diffusion language models, this factorization breaks. Generation is iterative and non-sequential — tokens are denoised in parallel, with masked positions revealed across multiple steps in arbitrary order. The log-likelihood of a final sequence is no longer a simple sum but a marginalization over the trajectory of denoising steps, which is intractable. This creates a significant technical barrier to applying the mature suite of RL algorithms developed for AR models to DLMs. The constraint travels with the AR factorization rather than with reasoning itself, which is why Does autoregressive generation uniquely enable LLM scaling? reframes which AR coupling is actually contingent.

The literature has converged on three streams of workaround. First, parallelizing the reasoning chain — Diffusion-of-Thought (DoT) reformulates CoT for parallel diffusion by treating reasoning steps as intermediate thoughts refined throughout the denoising process, with scheduled and coupled sampling for self-correction. Second, adapting policy gradient methods — variants of GRPO are introduced for DLMs, often by treating outcome rewards on the final answer rather than per-step likelihoods. Third, adapting preference optimization — DPO variants for DLMs work around the intractable likelihoods.

DCoLT (Diffusion Chain of Lateral Thought) is illustrative of what becomes possible once these adaptations exist. Treating each reverse diffusion step as a latent thinking action and optimizing the entire denoising trajectory with outcome-based RL produces +9.8% on GSM8K and +19.5% on HumanEval over base LLaDA, partly through a learned Unmasking Policy Module that selects token reveal order. The deeper point: DLMs do not lack reasoning capability, they lacked compatible post-training tools, and the cognitive style they unlock (lateral, parallel thinking rather than sequential vertical thinking) may be qualitatively different from AR-trained reasoning.


Source: Diffusion LLM

Related concepts in this collection

Concept map
16 direct connections · 138 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

applying RL to diffusion language models is hard because parallel non-sequential generation makes log-likelihood intractable — the technical barrier that blocks adapting GRPO and DPO