LLM Reasoning and Architecture Reinforcement Learning for LLMs

Why can't we easily adapt reinforcement learning to diffusion language models?

Autoregressive models enable efficient RL post-training through factorizable log-probabilities, but diffusion models generate tokens in parallel non-sequential order. What makes likelihood computation intractable in diffusion, and can we work around it?

Note · 2026-05-03 · sourced from Diffusion LLM

The maturation of post-training techniques in autoregressive LLMs — RLHF, RLAIF, GRPO, DPO — has been a major source of capability gain. These methods all rely on the ability to efficiently compute the log-probability of a generated sequence, which is straightforward in AR models because the joint probability factorizes along sequence position. Each token's probability is conditioned on the prior tokens, so the sequence log-probability is just a sum of token log-probabilities computed in a single forward pass.

In diffusion language models, this factorization breaks. Generation is iterative and non-sequential — tokens are denoised in parallel, with masked positions revealed across multiple steps in arbitrary order. The log-likelihood of a final sequence is no longer a simple sum but a marginalization over the trajectory of denoising steps, which is intractable. This creates a significant technical barrier to applying the mature suite of RL algorithms developed for AR models to DLMs. The constraint travels with the AR factorization rather than with reasoning itself, which is why Does autoregressive generation uniquely enable LLM scaling? reframes which AR coupling is actually contingent.

The literature has converged on three streams of workaround. First, parallelizing the reasoning chain — Diffusion-of-Thought (DoT) reformulates CoT for parallel diffusion by treating reasoning steps as intermediate thoughts refined throughout the denoising process, with scheduled and coupled sampling for self-correction. Second, adapting policy gradient methods — variants of GRPO are introduced for DLMs, often by treating outcome rewards on the final answer rather than per-step likelihoods. Third, adapting preference optimization — DPO variants for DLMs work around the intractable likelihoods.

DCoLT (Diffusion Chain of Lateral Thought) is illustrative of what becomes possible once these adaptations exist. Treating each reverse diffusion step as a latent thinking action and optimizing the entire denoising trajectory with outcome-based RL produces +9.8% on GSM8K and +19.5% on HumanEval over base LLaDA, partly through a learned Unmasking Policy Module that selects token reveal order. The deeper point: DLMs do not lack reasoning capability, they lacked compatible post-training tools, and the cognitive style they unlock (lateral, parallel thinking rather than sequential vertical thinking) may be qualitatively different from AR-trained reasoning.

Source: Diffusion LLM

Related concepts in this collection

Does autoregressive generation uniquely enable LLM scaling? Is the autoregressive factorization truly necessary for LLM scalability, or do other generative principles like diffusion achieve comparable performance? This matters because it shapes which architectural paths deserve investment.
extends: same paradigm shift; LLaDA shows why DLMs merit RL adaptation while this note shows the technical price
Can vanilla PPO match specialized reasoning algorithms with just two techniques? Does a minimalist combination of advantage normalization and token-level loss aggregation enable critic-free PPO to compete with more complex algorithms like GRPO and DAPO in language model reasoning tasks?
complements: the AR-side PPO/GRPO innovations the diffusion stream cannot directly inherit and must rebuild
Does RLVR actually expand what models can reason about? Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.
extends: if RLVR is sampling-efficiency on AR, the DLM RL question is whether equivalent boundaries even exist under non-sequential decoding
Can parallel architectures solve fundamentally sequential problems? Explores whether pure parallel computation—like Transformers—can tackle problems requiring long chains of dependent reasoning, or if serial depth is theoretically necessary for certain classes of problems.
tension: serial-scaling implies DLM lateral thinking faces a hard ceiling on inherently sequential problems regardless of RL adaptation
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
tension: DCoLT's outcome-based RL on denoising trajectory makes "when to think" into token-reveal-order rather than deployment timing

Concept map

16 direct connections · 138 in 2-hop network ·medium cluster

Why can't we easily adapt reinforcement learning… Does autoregressive generation uniquely enable LLM… Can vanilla PPO match specialized reasoning algori… Does RLVR actually expand what models can reason a… Can parallel architectures solve fundamentally seq… Does RL teach reasoning or just when to use it?

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

applying RL to diffusion language models is hard because parallel non-sequential generation makes log-likelihood intractable — the technical barrier that blocks adapting GRPO and DPO