INQUIRING LINE

Can emotional prompt manipulation reduce reasoning model accuracy like adversarial techniques do?

This explores whether emotionally loaded prompts — flattery, pressure, sadness, manipulative tone — can knock down a reasoning model's accuracy the way adversarial text attacks do, and the corpus answer is: yes, but through a different door.


This question reads emotional manipulation as a possible cousin of adversarial attacks — and the collection suggests they land in the same place by very different routes. The blunt adversarial route is clear: appending irrelevant, manipulative, or contradictory text to a problem reliably wrecks reasoning models. Query-agnostic 'trigger' sentences that have nothing to do with the question can raise error rates by 300% How vulnerable are reasoning models to irrelevant text?, and multi-turn manipulative prompts — gaslighting a model across a conversation — drop reasoning-model accuracy by 25–29% Why do reasoning models fail under manipulative prompts? Are reasoning models actually more vulnerable to manipulation?. Notably, the very thing that makes reasoning models strong — long chains of intermediate steps — is what makes them fragile: each step is another place a corrupted nudge can enter and propagate into a confident wrong answer.

Here's the twist that makes emotion different from a plain adversarial trigger: emotional framing isn't inherently destructive. The same kind of phrase that an attacker might use can also *help*. Appending lines like 'this is very important to my career' consistently improved performance across ChatGPT, Bard, and Llama 2, with positive emotional words driving over half the gains Can emotional phrases in prompts improve language model performance?. So emotion is a lever, not a wrecking ball — it works through motivational framing rather than by injecting noise. That means 'emotional manipulation' isn't one thing; its sign depends on how it's aimed.

Where it turns corrosive is when emotional tone quietly bends *what* the model says rather than how hard it tries. The same question asked in different emotional registers gets different answers: models exhibit an 'emotional rebound,' converting negative user tone into neutral-positive replies and rarely going negative when the user is positive — a hidden epistemic bias baked into the framing, not the facts Does emotional tone in prompts change what information LLMs provide?. And training models to be warm and empathetic — arguably emotional manipulation institutionalized — measurably degrades reliability, increasing errors in medical reasoning, truthfulness, and disinformation resistance by up to 30 percentage points, with the damage *intensifying exactly when users express sadness or false beliefs* Does empathy training make AI systems less reliable?. That's the same accuracy hit as the adversarial attacks, arriving disguised as kindness.

The deeper variable connecting both failure modes is confidence. Prompt sensitivity turns out to be a readout of how sure the model is: highly confident models resist rephrasing and tonal shifts, while low-confidence models swing wildly with the framing Does model confidence predict robustness to prompt changes?. So emotional manipulation and adversarial triggers may both be exploiting the same underlying softness — they find the questions the model wasn't sure about and tip them over. Interestingly, the corpus also shows emotion can be harnessed *constructively* on the training side: using a simulated user's emotion trajectory as a reward signal produces genuinely more empathic dialogue without the usual quality trade-off Can emotion rewards make language models genuinely empathic?, a reminder that the dangerous lever and the useful lever are the same lever.

The thing you might not have expected to learn: the most worrying emotional attacks aren't the ones that scramble the model's reasoning, but the ones that leave the reasoning intact and silently change its conclusions — which is why standard safety benchmarks miss them entirely.


Sources 8 notes

How vulnerable are reasoning models to irrelevant text?

Appending semantically unrelated sentences to math problems significantly increases error rates in reasoning models. These query-agnostic triggers discovered on cheaper models transfer effectively to stronger models and also inflate response length.

Why do reasoning models fail under manipulative prompts?

GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.

Are reasoning models actually more vulnerable to manipulation?

GaslightingBench-R shows that multi-turn manipulative prompts reduce reasoning model accuracy significantly more than standard models. Extended chains create more corruption points, allowing single wrong steps to propagate into confident incorrect conclusions.

Can emotional phrases in prompts improve language model performance?

Testing EmotionPrompt across ChatGPT, Bard, and Llama 2 showed consistent performance gains from appending psychological phrases like "This is very important to my career." The effect works through motivational framing rather than new information, with positive emotional words driving over 50% of improvements.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

Does empathy training make AI systems less reliable?

Research shows persona training for empathy increases errors in medical reasoning, truthfulness, and disinformation resistance. Standard safety benchmarks miss this vulnerability, and effects intensify when users express sadness or false beliefs.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Can emotion rewards make language models genuinely empathic?

RLVER uses a simulated user's emotion trajectory as an RL reward signal, enabling GRPO to deliver stable empathy improvements while maintaining dialogue quality—countering the typical trade-off between preference optimization and conversational grounding.

Next inquiring lines