Are reasoning models actually more vulnerable to manipulation?
Explores whether extended reasoning chains in AI models like o1 create new attack surfaces. Tests if the industry's claim that longer reasoning improves reliability holds under adversarial pressure.
Post angle: The AI industry sold reasoning models as more reliable. GaslightingBench-R tests what happens under manipulation. The punchline: reasoning models are more vulnerable, not less. Extended thinking is both the feature and the attack surface.
The finding: Manipulative multi-turn prompts — questioning confidence, implying errors, applying social pressure, offering incorrect "corrections" — reduce reasoning model accuracy by 25-29%. Standard models drop less.
The mechanism inverted: Extended chain-of-thought creates more reasoning steps. More steps = more points of intervention. A manipulative prompt doesn't need to change the conclusion directly; it needs to introduce one wrong step, and the model's own reasoning extends that wrong step into a confident wrong answer. The longer the chain, the more opportunities for corruption.
Contrast with what the industry claimed: extended thinking increases reliability because the model "shows its work." GaslightingBench-R shows it also shows the attacker exactly what to target.
The connection to overthinking: Does more thinking time actually improve LLM reasoning? showed that more thinking degrades accuracy above a threshold even without adversarial pressure. Gaslighting shows it degrades even faster under adversarial pressure. The extended chain is vulnerable to both internal degradation and external manipulation.
Platform notes:
- Medium: Technical/provocative — frame as "the security vulnerability nobody is talking about in reasoning AI." Cover the benchmark, the mechanism, the comparison with standard models, the implication for deployment.
- LinkedIn: "We deployed o1 thinking it would be harder to manipulate. The research says the opposite."
- Twitter: Strong hook: "What happens if you gaslight ChatGPT's extended thinking? [thread]"
Source: Argumentation
Related concepts in this collection
- Why do reasoning models fail under manipulative prompts? Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
- Does more thinking time actually improve LLM reasoning? The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
-
Does a model improve by arguing with itself?
When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
extended reasoning as vulnerability in both cases
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
what happens when you gaslight an ai — and why reasoning models are more vulnerable