Thinking Inside the Mask: In-Place Prompting in Diffusion LLMs
Despite large language models (LLMs) have achieved remarkable success, their prefix-only prompting paradigm and sequential generation process offer limited flexibility for bidirectional information. Diffusion large language models (dLLMs) present new opportunities through their bidirectional attention mechanisms and iterative refinement processes, enabling more flexible in-place prompting strategies. We introduce ICE (In-Place Chain-of-Thought Prompting with Early Exit), a novel framework that transforms prefix-only prompting into in-place prompting specifically designed for dLLMs. ICE integrates in-place prompts directly within masked token positions during iterative refinement and employs a confidence-aware early exit mechanism to significantly reduce computational overhead.
Unlike AR models that treat reasoning as sequential prefix conditioning, dLLMs can embed reasoning directly within the generation process itself with in-place prompting (Figure 1). Moreover, AR models exhibit sequential answer emergence, where answers remain inaccessible until the completion of sequential generation, while dLLMs enable concurrent answer accessibility through their bidirectional context modeling, allowing intermediate visibility of answer content during the iterative refinement process. This architectural distinction creates opportunities for novel confidence-aware optimization strategies that can monitor answer during generation.
In-Place Chain-of-Thought Prompting: This approach integrates reasoning steps directly into masked token positions during iterative refinement. It exploits the bidirectional nature of dLLMs by structuring the generation sequence into distinct thinking and answer sections, with explicit step-by-step reasoning templates embedded within the thinking section. This enables enhanced reasoning performance while preserving parallel generation advantages.
Two-Phase Decoding with Early Exit Mechanism: Motivated by a crucial empirical observation, we design a confidence-aware inference strategy that capitalizes on the distinct refinement patterns between reasoning and answer components. Through systematic analysis of iterative refinement dynamics, we uncover a distinctive behavioral pattern in dLLMs: model confidence in answer tokens converges rapidly to high levels and maintains stability, while the reasoning section continues undergoing refinement (Figure 2). This observation reveals that models often internally determine correct answers significantly earlier than the completion of explicit reasoning traces. Leveraging this insight, ICE implements a two-phase decoding approach that enables parallel decoding of all answer tokens while effectively reducing redundant computation.
In-Place Chain-of-Thought Prompting The iterative, non-autoregressive generation paradigm of dLLMs, coupled with their inherent bidirectional attention mechanisms, enables a fundamental departure from the conventional prefix-only prompting strategies employed in autoregressive models. While autoregressive models are constrained by sequential, left-to-right generation processes, dLLMs possess the capability to consider entire sequence contexts simultaneously and enable concurrent answer accessibility. This architectural advantage unlocks novel prompting paradigms as it allows the model to simultaneously refine reasoning steps while maintaining awareness of answer regions throughout the generation process.
Our approach leverages this distinctive capability by structuring the generation sequence ygen into two semantically distinct sections: a thinking section y thinking and an answer section y answer. This structural division is uniquely enabled by dLLMs’ bidirectional nature: Unlike autoregressive models where reasoning must be sequentially generated before any answer content becomes available, dLLMs can simultaneously consider both reasoning and answer contexts during iterative refinement. Formally, we define the generation sequence as: