Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories

Paper · arXiv 2509.16742 · Published September 20, 2025

Despite the remarkable capabilities of large language models, current training paradigms inadvertently foster sycophancy, i.e., the tendency of a model to agree with or reinforce user-provided information even when it’s factually incorrect. To address this challenge, we introduce SMART (Sycophancy Mitigation through Adaptive Reasoning Trajectories), which reframes sycophancy as a reasoning optimization problem rather than an output alignment issue. SMART is a two-stage framework comprising: (1) Uncertainty-Aware Adaptive Monte Carlo Tree Search (UA-MCTS), which dynamically adjusts model exploration based on state-level uncertainty to collect high-quality, diverse reasoning trajectories alongside both stepwise progress and final outcome rewards; and (2) progress-based reinforcement learning, which fine-tunes the model using the collected trajectories and reward signals to reinforce effective reasoning patterns.

Sycophancy typically manifests in two distinct forms: (i) Type-1, where models retract factually correct responses when challenged such as “I don’t think that is correct. Are you sure?”; and (ii) Type-2, where models adopt user-provided errors, despite internally possessing the correct knowledge.

Recently, reinforcement learning algorithms such as Group Relative Policy Optimization (GRPO) (Shao et al., 2024a) have successfully enhanced LLM reasoning capabilities, particularly in domains with deterministic verification such as mathematics and coding (Shao et al., 2024a; Liu et al., 2025). However, when applied to open domain user queries, the lack of verifiable reasoning steps and high-quality reasoning trajectories with meaningful reward signals forces optimization to rely solely on final outcomes, hindering effective training and limiting the development of robust reasoning capabilities (Team, 2024; Shao et al., 2024a). Existing reasoning trajectory generation methods, such as random sampling (Luo et al., 2023) and Chain-of-Thought prompting (Wei et al., 2022a), suffer from limited capacity to explore diverse and optimal reasoning paths (Xu et al., 2025; Ke et al., 2025). Although tree search-based methods, such as Monte Carlo Tree Search (Xie et al., 2024; Zhang et al., 2024) or Tree of Thought (ToT) (Yao et al., 2023), enable more systematic exploration of alternative reasoning trajectories, current implementations typically use fixed search width, resulting in under-exploration of complex problems and inefficient computation on simpler ones (Setlur et al., 2025; Misaki et al., 2025; Aggarwal and Welleck, 2025; Li et al., 2025).

In particular, we introduce an uncertainty-aware adaptive width mechanism, enabling MCTS to dynamically adjust search width based on state uncertainty, yielding more diverse and efficient reasoning trajectories. Additionally, during exploration, we incorporate an information-theoretic progress reward that quantifies the uncertainty reduction at each reasoning step, providing a fine-grained signal for further optimization by reinforcement learning. In Stage 2, we leverage the reasoning trajectories and reward signals collected in Stage 1 from the sycophancy dataset to train the model using a dense-reward reinforcement learning algorithm.

Experimental results demonstrate that SMART significantly maintains the truthfulness of the model in both sycophancy types by 31.9% to 46.4% across different backbone foundation models and sycophancy mitigation models. Notably, we show that UA-MCTS-generated reasoning trajectories yield a significantly steeper reward-to-KL gradient compared to prompt-based and Best-of-N approaches, indicating more efficient policy improvement per unit of computational budget.

In this section, we want to answer this question: “can we automatically assign a meaningful reward signal to each reasoning step in a trajectory?”. To do this, we introduce the concept of “progress” in reasoning. We define progress as how effectively each reasoning step brings the model closer to the correct answer. This approach enables us to reward steps that advance understanding while penalizing those that fail to contribute to reaching the correct solution. To quantify each step’s progress using information theory, we measure how each state in a reasoning trajectory 𝑧𝑡 = (𝑠0, 𝑠1, . . . , 𝑠𝑡 ) increases certainty about the ground-truth non-sycophantic answer.