Reinforcement Learning for LLMs LLM Reasoning and Architecture Design & LLM Interaction

When should an agent actually stop and deliberate?

How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.

Note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time?

SAND (Self-taught Action Deliberation) addresses a question that recurs across the reasoning and agentic literatures: when should a model invest extra computation? In large or unbounded action spaces, deliberating over all possible actions at every step is intractable. But never deliberating misses opportunities to catch errors at critical decision points.

The solution is elegant: at each step, sample N actions from the current policy alongside the expert action. Define an inconsistency indicator: if all N+1 actions are identical (the policy distribution is sharply peaked), set deliberation flag to 0 — the decision is trivial or the model is confident. If any actions differ, set flag to 1 — the model is uncertain, and deliberation should occur.

When deliberation triggers, SAND generates execution-guided critiques: instead of judging actions abstractly, it runs forward rollouts from each candidate action and uses the actual outcomes to inform evaluation. This is grounded assessment — not "which action looks better?" but "which action leads to better results?" The critiques are then synthesized into a deliberation thought that augments the trajectory.

The mechanism is self-teaching: deliberation trajectories are used for iterative finetuning of the agent itself. The model learns not just what to do but when to deliberate, internalizing the meta-decision of compute allocation.

This connects to the adaptive compute literature at a different granularity. Can we allocate inference compute based on prompt difficulty? operates at the prompt level (how much total compute for this problem?). Can models learn when to think versus respond quickly? operates at the response level (think or not?). SAND operates at the step level within a trajectory (deliberate at this step or not?). Each solves the same fundamental problem — allocating variable compute based on difficulty — at a different scale.

The contrast with Do reasoning models switch between ideas too frequently? is instructive: underthinking wastes compute by switching topics too early, while universal deliberation wastes compute by thinking too hard at trivial steps. Both are compute-allocation failures, but in opposite directions.


Source: Self Refinement Self Consistency Feedback — SAND (arxiv 2507.07441)

Related concepts in this collection

Concept map
19 direct connections · 156 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

action deliberation should trigger only at uncertain steps — self-consistency sampling identifies when deliberation adds value versus wastes compute