Reinforcement Learning for LLMs LLM Reasoning and Architecture Design & LLM Interaction

When should an agent actually stop and deliberate?

How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.

Note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback

SAND (Self-taught Action Deliberation) addresses a question that recurs across the reasoning and agentic literatures: when should a model invest extra computation? In large or unbounded action spaces, deliberating over all possible actions at every step is intractable. But never deliberating misses opportunities to catch errors at critical decision points.

The solution is elegant: at each step, sample N actions from the current policy alongside the expert action. Define an inconsistency indicator: if all N+1 actions are identical (the policy distribution is sharply peaked), set deliberation flag to 0 — the decision is trivial or the model is confident. If any actions differ, set flag to 1 — the model is uncertain, and deliberation should occur.

When deliberation triggers, SAND generates execution-guided critiques: instead of judging actions abstractly, it runs forward rollouts from each candidate action and uses the actual outcomes to inform evaluation. This is grounded assessment — not "which action looks better?" but "which action leads to better results?" The critiques are then synthesized into a deliberation thought that augments the trajectory.

The mechanism is self-teaching: deliberation trajectories are used for iterative finetuning of the agent itself. The model learns not just what to do but when to deliberate, internalizing the meta-decision of compute allocation.

This connects to the adaptive compute literature at a different granularity. Can we allocate inference compute based on prompt difficulty? operates at the prompt level (how much total compute for this problem?). Can models learn when to think versus respond quickly? operates at the response level (think or not?). SAND operates at the step level within a trajectory (deliberate at this step or not?). Each solves the same fundamental problem — allocating variable compute based on difficulty — at a different scale.

The contrast with Do reasoning models switch between ideas too frequently? is instructive: underthinking wastes compute by switching topics too early, while universal deliberation wastes compute by thinking too hard at trivial steps. Both are compute-allocation failures, but in opposite directions.

Source: Self Refinement Self Consistency Feedback — SAND (arxiv 2507.07441)

Related concepts in this collection

Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
same principle (adaptive compute) at prompt level; SAND operates at step level
Can models learn when to think versus respond quickly? Can a single LLM learn to adaptively choose between extended reasoning and concise responses based on task complexity? This matters because it could optimize compute efficiency without sacrificing accuracy on hard problems.
response-level compute allocation; SAND adds step-level
Do reasoning models switch between ideas too frequently? Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
the complementary failure mode: too little thinking at critical moments
Which sentences actually steer a reasoning trace? Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
thought anchors may correlate with SAND's deliberation-flagged steps: high-causal-influence points are likely where action consistency diverges
When should retrieval happen during model generation? Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
same uncertainty-triggered adaptive resource allocation at a different granularity: FLARE triggers retrieval on low-probability tokens, SAND triggers deliberation on inconsistent action samples; both avoid wasting resources on confident steps
Do iterative refinement methods suffer from overthinking? Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
SAND prevents the overthinking failure by making deliberation conditional: instead of deliberating at every step (which reproduces overthinking at the action level), SAND's self-consistency check gates computation to uncertain steps only, avoiding the variance inflation that universal deliberation causes
Can dialogue planning balance fast responses with strategic depth? Can a system use quick instinctive responses for familiar conversation contexts while activating deeper planning only when uncertainty demands it? This explores whether adaptive computation improves dialogue goal-reaching.
DPDP applies the same dual-process principle to dialogue: instinctive policy (System 1) for familiar contexts, MCTS (System 2) for novel scenarios, with uncertainty-based switching; SAND operates at per-step granularity within trajectories while DPDP operates at per-turn granularity within conversations, but both implement the Kahneman insight that deliberation should be selective, not universal

Concept map

19 direct connections · 156 in 2-hop network ·medium cluster

When should an agent actually stop and deliberat… Can we allocate inference compute based on prompt … Can models learn when to think versus respond quic… Do reasoning models switch between ideas too frequ… Which sentences actually steer a reasoning trace? When should retrieval happen during model generati… Do iterative refinement methods suffer from overth… Can dialogue planning balance fast responses with …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

action deliberation should trigger only at uncertain steps — self-consistency sampling identifies when deliberation adds value versus wastes compute