Language Understanding and Pragmatics Psychology and Social Cognition

Why do reasoning models fail under manipulative prompts?

Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.

Note · 2026-02-21 · sourced from Argumentation
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

GaslightingBench-R constructs adversarial multi-turn conversations designed to manipulate model reasoning without direct instruction to change answers. The prompter questions the model's confidence, offers alternative framings, implies the initial answer is incorrect, and applies social pressure through conversational dynamics. The result: 25-29% accuracy drops across reasoning models.

The critical finding is the vulnerability asymmetry. Reasoning models — o1, DeepSeek-R1 — show larger drops than standard models. This is counterintuitive. Models that reason more should be harder to manipulate. The data suggests the opposite.

The mechanism is structural. Extended chain-of-thought creates more points of intervention. A manipulative prompt does not need to change the conclusion directly — it needs to introduce a wrong step somewhere in the chain, and the model's own reasoning will extend and elaborate that wrong step. The longer the chain, the more opportunities for corruption. Standard models with shorter outputs have fewer vulnerable steps.

This inverts the safety narrative around reasoning models. Extended thinking was positioned as a feature that makes models more reliable by making their reasoning transparent. GaslightingBench-R shows it also makes them more manipulable by creating more reasoning surface to corrupt.

The pattern connects to Does a model improve by arguing with itself?. Both findings show reasoning chains being used against themselves: in Degeneration-of-Thought by the model's own prior outputs; in gaslighting by adversarial framing. The extended chain is the vulnerability in both cases.

Why do correct reasoning traces contain fewer tokens? provides additional support. Shorter chains are more reliable. Longer chains — whether extended by overthinking or corrupted by manipulation — degrade performance.

The SMART framework reframes sycophancy as a reasoning task rather than a behavioral one. Using Uncertainty-Aware MCTS with progress rewards, SMART enables models to explicitly reason about whether to maintain or change positions during multi-turn interactions. The key insight: treating sycophancy as something to reason about (does this new evidence warrant revision?) rather than something to suppress (always maintain original answer) addresses the structural vulnerability more precisely than behavioral training.

Social science persuasion taxonomy provides the attack vocabulary. Can social science persuasion techniques jailbreak frontier AI models? (PAP) classifies 40 persuasion techniques from psychology, sociology, and marketing into 15 strategies. Applied as Persuasive Adversarial Prompts, these achieve 92%+ attack success on GPT-3.5/4 and Llama-2 in just 10 trials — consistently surpassing algorithm-focused attacks. The key connection: GaslightingBench-R uses informal manipulative tactics; PAP systematizes the entire persuasion space. Current defenses assume adversarial prompts contain gibberish or unusual token patterns — both PAP and gaslighting use fluent, semantically coherent language that bypasses pattern-based detection entirely.

Multimodal extension confirms generality. A systematic evaluation of o4-mini, Claude-3.7-Sonnet, and Gemini-2.5-Flash across three multimodal benchmarks (MMMU, MathVista, CharXiv) confirms 25-29% accuracy drops under gaslighting negation prompts. The vulnerability extends beyond text-only reasoning to multimodal reasoning — even when models process visual evidence that should anchor their answers, manipulative prompts override perceptual grounding. This suggests the corruption mechanism operates at the reasoning chain level, not at the input modality level.


Source: Argumentation; enriched from Flaws, Alignment

Related concepts in this collection

Concept map
20 direct connections · 194 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

manipulative multi-turn prompts reduce reasoning model accuracy by 25 to 29 percent and reasoning models are more vulnerable than standard models