Can reasoning and answers be generated separately in language models?
Explores whether diffusion LLMs can embed reasoning prompts directly within generation sequences rather than as prefixes, and whether answers and reasoning can be decoupled as independent refinement axes.
Autoregressive models can only attach prompts as a prefix because generation proceeds left-to-right — reasoning must be sequentially generated before any answer becomes accessible, which means a CoT prompt has to live entirely in the prefix and the answer becomes available only at the end of decoding. Diffusion LLMs have bidirectional attention and iterative refinement, which structurally permits a different prompting strategy: in-place prompts, embedded directly within masked token positions, refined alongside the rest of the sequence.
ICE (In-Place Chain-of-Thought Prompting with Early Exit) operationalizes this by structuring the generation sequence into two semantically distinct sections — a thinking section and an answer section — with explicit step-by-step reasoning templates embedded in the thinking section as in-place prompts. Both sections are refined simultaneously through the diffusion denoising process, so the model can refine reasoning steps while maintaining awareness of answer regions throughout generation. This is impossible in AR models, where answer content is inaccessible until reasoning completes.
The second contribution exploits a previously unnamed property of dLLM refinement dynamics: confidence in answer tokens converges rapidly to high levels and stays stable, while the reasoning section continues to undergo refinement long after. This means models often determine the correct answer significantly earlier than the explicit reasoning trace stabilizes — a kind of intuitive answer commitment followed by post-hoc reasoning, mirroring the structure of human dual-process cognition (and aligning with Does chain-of-thought reasoning reflect genuine thinking or performance? from the AR side). ICE uses a confidence-aware early-exit mechanism to cut compute by parallel-decoding answer tokens once their confidence has converged, even while reasoning is still being refined.
The structural implication is that in dLLMs, reasoning and answering are decouplable axes of generation rather than a temporally ordered sequence. The reasoning trace can serve roles other than producing the answer — for example, post-hoc justification or interpretability — and the answer can be produced from internal state earlier than the visible reasoning suggests. This breaks the AR-era assumption that visible CoT length is an upper bound on compute spent before answering.
Source: Diffusion LLM
Related concepts in this collection
-
Can diffusion models commit to answers before full decoding?
Do diffusion language models settle on correct answers early in their refinement process, and if so, can we detect and exploit this convergence to speed up inference without losing quality?
complements: same early-convergence property — Prophet exploits it for stopping; ICE exploits it for prompt-structure decoupling
-
Can diffusion models enable control that autoregressive models cannot reach?
Autoregressive language models struggle with complex global controls like syntax and infilling because they generate left-to-right and have discrete token bottlenecks. Can diffusion models' continuous latents and parallel denoising overcome these structural limitations?
extends: bidirectional attention enables both control (Diffusion-LM) and prompting (ICE) capabilities AR cannot match
-
Does chain-of-thought reasoning reflect genuine thinking or performance?
When language models generate step-by-step reasoning, are they actually thinking through problems or just producing text that looks like reasoning? This matters for understanding whether extended reasoning tokens add real computational value.
exemplifies: AR analogue — early commitment plus post-hoc reasoning is structurally similar across paradigms
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
extends: in-place dLLM reasoning makes the post-hoc-justification reading explicit — the answer is produced before the trace stabilizes
-
Can dialogue planning balance fast responses with strategic depth?
Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.
complements: ICE's intuitive-then-refine structure is dual-process at the decoding level rather than at the dialogue-planning level
-
Does AI actually commodify expertise or tokenize it?
The standard framing treats AI output like mass-produced commodities, but does AI's contextual, mutable nature fit better with token economics than commodity theory?
tension: in-place prompting fragments the strict AR token-by-token story — generation is not strictly sequential when prompts and answers refine together
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
in-place prompting in diffusion LLMs eliminates the prefix-only constraint of autoregressive prompting — reasoning embeds within masked positions during refinement