Reinforcement Learning for LLMs LLM Reasoning and Architecture Agentic and Multi-Agent Systems

Can dialogue planning balance fast responses with strategic depth?

Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.

Note · 2026-02-22 · sourced from Conversation Architecture Structure
How should we allocate compute budget at inference time? Why do AI agents fail to take initiative? How should researchers navigate LLM reasoning research?

Proactive dialogue requires planning — steering conversations toward predetermined goals. LLMs typically struggle with this because of their reactive nature. The Dual-Process Dialogue Planning (DPDP) framework addresses this by implementing Kahneman's System 1/System 2 distinction:

System 1 — A neural policy language model that handles familiar dialogue contexts with quick, instinctive responses. Trained through offline RL to build a robust initial policy that mitigates suboptimal strategies from noisy training data.

System 2 — An MCTS-based planner that provides analytical, rational (but slower) planning for complex or novel scenarios where the policy model is uncertain.

Dynamic switching between systems is driven by the policy model's own uncertainty estimate. When the model is confident about the next dialogue action, System 1 fires. When uncertainty is high — novel context, complex goal structure, ambiguous user behavior — System 2 activates for deeper search.

The two-stage training is the key innovation. Stage 1 uses offline RL to refine the policy model's base capabilities. Stage 2 uses MCTS simulations to guide the policy model toward generating superior strategies, accelerating convergence. The policy model progressively internalizes the MCTS planner's strategic depth, so over time System 1 handles more situations directly.

This connects directly to existing test-time compute findings. Since Can models learn when to think versus respond quickly?, DPDP applies the same principle to dialogue planning: spend more compute (MCTS) only when the policy model's uncertainty warrants it. The result is efficiency matching or exceeding pure MCTS-based methods while maintaining strategic depth.

The architecture embodies the broader principle that Does RL teach reasoning or teach when to use it? — the policy model's dialogue capabilities already exist from pretraining, and the uncertainty-switching mechanism teaches WHEN to deploy deep planning rather than how to plan. Additionally, by restricting System 2 (MCTS) to uncertain contexts, DPDP naturally avoids the overthinking threshold documented in Does more thinking time always improve reasoning accuracy? — deep search activates only when warranted, preventing the universal application of extended reasoning that degrades performance.


Source: Conversation Architecture Structure

Related concepts in this collection

Concept map
19 direct connections · 177 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

dual-process dialogue planning applies System 1 and System 2 cognition to conversation — instinctive policy for familiar contexts and MCTS for novel scenarios switching on uncertainty