LLM Reasoning and Architecture

Can reasoning and answers be generated separately in language models?

Explores whether diffusion LLMs can embed reasoning prompts directly within generation sequences rather than as prefixes, and whether answers and reasoning can be decoupled as independent refinement axes.

Note · 2026-05-03 · sourced from Diffusion LLM

Autoregressive models can only attach prompts as a prefix because generation proceeds left-to-right — reasoning must be sequentially generated before any answer becomes accessible, which means a CoT prompt has to live entirely in the prefix and the answer becomes available only at the end of decoding. Diffusion LLMs have bidirectional attention and iterative refinement, which structurally permits a different prompting strategy: in-place prompts, embedded directly within masked token positions, refined alongside the rest of the sequence.

ICE (In-Place Chain-of-Thought Prompting with Early Exit) operationalizes this by structuring the generation sequence into two semantically distinct sections — a thinking section and an answer section — with explicit step-by-step reasoning templates embedded in the thinking section as in-place prompts. Both sections are refined simultaneously through the diffusion denoising process, so the model can refine reasoning steps while maintaining awareness of answer regions throughout generation. This is impossible in AR models, where answer content is inaccessible until reasoning completes.

The second contribution exploits a previously unnamed property of dLLM refinement dynamics: confidence in answer tokens converges rapidly to high levels and stays stable, while the reasoning section continues to undergo refinement long after. This means models often determine the correct answer significantly earlier than the explicit reasoning trace stabilizes — a kind of intuitive answer commitment followed by post-hoc reasoning, mirroring the structure of human dual-process cognition (and aligning with Does chain-of-thought reasoning reflect genuine thinking or performance? from the AR side). ICE uses a confidence-aware early-exit mechanism to cut compute by parallel-decoding answer tokens once their confidence has converged, even while reasoning is still being refined.

The structural implication is that in dLLMs, reasoning and answering are decouplable axes of generation rather than a temporally ordered sequence. The reasoning trace can serve roles other than producing the answer — for example, post-hoc justification or interpretability — and the answer can be produced from internal state earlier than the visible reasoning suggests. This breaks the AR-era assumption that visible CoT length is an upper bound on compute spent before answering.


Source: Diffusion LLM

Related concepts in this collection

Concept map
18 direct connections · 163 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

in-place prompting in diffusion LLMs eliminates the prefix-only constraint of autoregressive prompting — reasoning embeds within masked positions during refinement