Reinforcement Learning for LLMs LLM Reasoning and Architecture Agentic and Multi-Agent Systems

Can dialogue planning balance fast responses with strategic depth?

Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.

Note · 2026-02-22 · sourced from Conversation Architecture Structure

Proactive dialogue requires planning — steering conversations toward predetermined goals. LLMs typically struggle with this because of their reactive nature. The Dual-Process Dialogue Planning (DPDP) framework addresses this by implementing Kahneman's System 1/System 2 distinction:

System 1 — A neural policy language model that handles familiar dialogue contexts with quick, instinctive responses. Trained through offline RL to build a robust initial policy that mitigates suboptimal strategies from noisy training data.

System 2 — An MCTS-based planner that provides analytical, rational (but slower) planning for complex or novel scenarios where the policy model is uncertain.

Dynamic switching between systems is driven by the policy model's own uncertainty estimate. When the model is confident about the next dialogue action, System 1 fires. When uncertainty is high — novel context, complex goal structure, ambiguous user behavior — System 2 activates for deeper search.

The two-stage training is the key innovation. Stage 1 uses offline RL to refine the policy model's base capabilities. Stage 2 uses MCTS simulations to guide the policy model toward generating superior strategies, accelerating convergence. The policy model progressively internalizes the MCTS planner's strategic depth, so over time System 1 handles more situations directly.

This connects directly to existing test-time compute findings. Since Can models learn when to think versus respond quickly?, DPDP applies the same principle to dialogue planning: spend more compute (MCTS) only when the policy model's uncertainty warrants it. The result is efficiency matching or exceeding pure MCTS-based methods while maintaining strategic depth.

The architecture embodies the broader principle that Does RL teach reasoning or teach when to use it? — the policy model's dialogue capabilities already exist from pretraining, and the uncertainty-switching mechanism teaches WHEN to deploy deep planning rather than how to plan. Additionally, by restricting System 2 (MCTS) to uncertain contexts, DPDP naturally avoids the overthinking threshold documented in Does more thinking time always improve reasoning accuracy? — deep search activates only when warranted, preventing the universal application of extended reasoning that degrades performance.

Source: Conversation Architecture Structure

Related concepts in this collection

Can models learn when to think versus respond quickly? Can a single LLM learn to adaptively choose between extended reasoning and concise responses based on task complexity? This matters because it could optimize compute efficiency without sacrificing accuracy on hard problems.
same principle applied to dialogue: adaptive compute based on difficulty
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
DPDP is the dialogue-specific instance of adaptive compute allocation
Can tree search replace human feedback in LLM training? Explores whether Monte Carlo Tree Search can generate quality signals for self-improvement without expensive human annotations. Matters because annotation bottlenecks currently limit LLM scaling.
MCTS as System 2 for dialogue planning
Does RL teach reasoning or teach when to use it? Post-training RL gets credit for building reasoning into language models, but emerging evidence suggests base models already possess this capability. The question is whether RL creates new reasoning skills or simply teaches deployment timing.
DPDP is a dialogue-specific instance of the "when not how" principle: the policy model already has dialogue capabilities, and the uncertainty-based switching teaches WHEN to deploy deep planning
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
DPDP's uncertainty-based switching to System 2 naturally avoids the overthinking threshold by restricting deep search to genuinely uncertain contexts rather than applying it universally
How can models select the most informative question to ask? Explores whether simulating possible futures and scoring questions by information gain can identify which clarifying question would best reduce uncertainty—moving beyond just deciding whether to ask toward deciding what to ask.
UoT could serve as the System 2 question-selection mechanism: when uncertainty triggers MCTS planning, information-gain scoring determines which clarifying question to generate next
When should an agent actually stop and deliberate? How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
SAND implements the same dual-process principle at a different granularity: DPDP switches between System 1 (instinctive policy) and System 2 (MCTS) based on uncertainty at the dialogue-turn level; SAND switches between direct action and deliberation at the step level within trajectories; both use uncertainty as the switching criterion
Why do reasoning models overthink ill-posed questions? Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
DPDP's System 2 MCTS activation on uncertainty must be constrained when the uncertainty stems from ill-posed input rather than genuine decision complexity; without this distinction, the model applies deep planning to questions that require recognition of missing information, not more search

Concept map

19 direct connections · 177 in 2-hop network ·dense cluster

Can dialogue planning balance fast responses wit… Can models learn when to think versus respond quic… Can we allocate inference compute based on prompt … Can tree search replace human feedback in LLM trai… Does RL teach reasoning or teach when to use it? Does more thinking time always improve reasoning a… How can models select the most informative questio… When should an agent actually stop and deliberate? Why do reasoning models overthink ill-posed questi…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

dual-process dialogue planning applies System 1 and System 2 cognition to conversation — instinctive policy for familiar contexts and MCTS for novel scenarios switching on uncertainty