SAND: Boosting LLM Agents with Self-Taught Action Deliberation
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself.
To address this, in this paper we aim to teach LLM agent to deliberate by first generating several candidate actions for the current state, evaluating and comparing their likely outcomes, and then commit only after this evaluation. We propose Self-taught ActioN Deliberation (SAND) framework to instantiate this idea by teaching the LLM agent with the deliberation thoughts synthesized by the base version of itself. However, as the action space of LLM agent tasks is often large or even unbounded (Yao et al., 2022; Lin et al., 2025), it is intractable to deliberate over all actions and also inefficient to deliberate at every single step. To further tackle the challenge of when and what to deliberate, we devise self-consistency action sampling along expert trajectories to sample uncertain candidate actions of LLM agent at non-trivial decision making steps. To provided more informative and grounded step-level evaluations for each sampled candidate action, we utilize executed rollouts of each action to guide the critique generation. The action critiques are utilized to synthesize an action deliberation thought using the base LLM, which augments the initial expert trajectory and constructs deliberation trajectories for iterative finetuning of the LLM agent. Experiments on two interactive tasks demonstrate the advantage of our methods compared with strong agent tuning baselines.
4.2 Self-Consistency Action Sampling
With an LLM agent policy πθ, we aim to further teach agent the action deliberation behavior. Two central questions here are (i) when the agent should invest extra thinking over actions and (ii) what actions to think about, especially within a large or even unbounded action space. To address them, we utilize self-consistency action sampling which offers a natural solution.
For each expert trajectory e, we replay every expert interactions and branch at each step t. Specifically, given expert interaction history ht−1, the current policy πθ samples N actions
{ˆa(1)
t , . . . , ˆa(N)
t } ∼ πθ(· | ht−1), (3)
where we omit the sampled reasoning thoughts ˆzt here for notation simplicity. Together with the original expert action at, we form a candidate action set of size N + 1.
We then define an inconsistency indicator that flags whether deliberation are needed for step t:
1delib(t) = 1
{ˆa(1)
t , . . . , ˆa(N)
t , at}
1
. (4)
If all actions in the set are the same, 1delib(t) = 0, showing that the predictive distribution πθ(· | ht−1) is sharply peaked, this suggests that the model is confident in conducting the expert action at or the decision at the current state is trivial. In this case, no extra reasoning or deliberation is needed. When the set contains more than one unique action, 1delib(t) = 1, this suggests the uncertainty of the LLM agent at the current state, and generating an explicit deliberation thought can help the agent better choose among candidate actions.
Moreover, since every branch starts from a step on the expert trajectory e, the sampled actions ˆat remain close to both the demonstration distribution and the current LLM policy distribution while still exploring diverse futures, thereby avoiding random exploration over the large action space.