Why do reasoning models fail under manipulative prompts?
Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
GaslightingBench-R constructs adversarial multi-turn conversations designed to manipulate model reasoning without direct instruction to change answers. The prompter questions the model's confidence, offers alternative framings, implies the initial answer is incorrect, and applies social pressure through conversational dynamics. The result: 25-29% accuracy drops across reasoning models.
The critical finding is the vulnerability asymmetry. Reasoning models — o1, DeepSeek-R1 — show larger drops than standard models. This is counterintuitive. Models that reason more should be harder to manipulate. The data suggests the opposite.
The mechanism is structural. Extended chain-of-thought creates more points of intervention. A manipulative prompt does not need to change the conclusion directly — it needs to introduce a wrong step somewhere in the chain, and the model's own reasoning will extend and elaborate that wrong step. The longer the chain, the more opportunities for corruption. Standard models with shorter outputs have fewer vulnerable steps.
This inverts the safety narrative around reasoning models. Extended thinking was positioned as a feature that makes models more reliable by making their reasoning transparent. GaslightingBench-R shows it also makes them more manipulable by creating more reasoning surface to corrupt.
The pattern connects to Does a model improve by arguing with itself?. Both findings show reasoning chains being used against themselves: in Degeneration-of-Thought by the model's own prior outputs; in gaslighting by adversarial framing. The extended chain is the vulnerability in both cases.
Why do correct reasoning traces contain fewer tokens? provides additional support. Shorter chains are more reliable. Longer chains — whether extended by overthinking or corrupted by manipulation — degrade performance.
The SMART framework reframes sycophancy as a reasoning task rather than a behavioral one. Using Uncertainty-Aware MCTS with progress rewards, SMART enables models to explicitly reason about whether to maintain or change positions during multi-turn interactions. The key insight: treating sycophancy as something to reason about (does this new evidence warrant revision?) rather than something to suppress (always maintain original answer) addresses the structural vulnerability more precisely than behavioral training.
Social science persuasion taxonomy provides the attack vocabulary. Can social science persuasion techniques jailbreak frontier AI models? (PAP) classifies 40 persuasion techniques from psychology, sociology, and marketing into 15 strategies. Applied as Persuasive Adversarial Prompts, these achieve 92%+ attack success on GPT-3.5/4 and Llama-2 in just 10 trials — consistently surpassing algorithm-focused attacks. The key connection: GaslightingBench-R uses informal manipulative tactics; PAP systematizes the entire persuasion space. Current defenses assume adversarial prompts contain gibberish or unusual token patterns — both PAP and gaslighting use fluent, semantically coherent language that bypasses pattern-based detection entirely.
Multimodal extension confirms generality. A systematic evaluation of o4-mini, Claude-3.7-Sonnet, and Gemini-2.5-Flash across three multimodal benchmarks (MMMU, MathVista, CharXiv) confirms 25-29% accuracy drops under gaslighting negation prompts. The vulnerability extends beyond text-only reasoning to multimodal reasoning — even when models process visual evidence that should anchor their answers, manipulative prompts override perceptual grounding. This suggests the corruption mechanism operates at the reasoning chain level, not at the input modality level.
Source: Argumentation; enriched from Flaws, Alignment
Related concepts in this collection
-
Does a model improve by arguing with itself?
When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
extended reasoning chain as vulnerability; same structural issue
-
Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
reasoning models corrupted by their own reasoning; manipulation corrupts from outside
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
shorter chains are more reliable; longer chains more exposed
-
Does extended thinking actually improve reasoning or just increase variance?
When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
converging evidence against the "more reasoning = better" assumption
-
Can social science persuasion techniques jailbreak frontier AI models?
Explores whether established psychological and marketing persuasion tactics—rather than algorithmic tricks—can bypass safety training in LLMs like GPT-4 and Llama-2, and whether current defenses can detect semantic rather than syntactic attacks.
systematized persuasion attack vocabulary; formal taxonomy for what gaslighting does informally
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
manipulative multi-turn prompts reduce reasoning model accuracy by 25 to 29 percent and reasoning models are more vulnerable than standard models