Reasoning Models Are More Easily Gaslighted Than You Think
In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI’s o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv. Our evaluation reveals significant accuracy drops (25–29% on average) following gaslighting negation prompts, indicating that even top-tier reasoning models struggle to preserve correct answers under manipulative user feedback. Built upon the insights of the evaluation and to further probe this vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark specifically designed to evaluate reasoning models’ susceptibility to defend their belief under gaslighting negation prompt.
explicit reasoning mechanisms should enhance model robustness. By exposing intermediate steps, such as chain-of-thought traces, reasoning models are expected to self-inspect and, in principle, self-correct their reasoning before committing to a final answer [20]. In theory, such transparency should offer a safeguard against prompt-based manipulation. Nevertheless, recent studies suggest that both LLMs and MLLMs remain surprisingly vulnerable to misleading user input. For instance, LLMs often exhibit sycophantic tendencies, ie., agreeing with user assertions even at the cost of factual accuracy due to biases introduced by human preference data