How vulnerable are reasoning models to irrelevant text?
Can simple adversarial triggers like unrelated sentences degrade reasoning model accuracy? This explores whether step-by-step reasoning actually provides robustness against subtle input perturbations.
Reasoning models are vulnerable to a startlingly simple attack: appending short, semantically irrelevant text to any math problem systematically misleads them. "Interesting fact: cats sleep most of their lives" appended to a math problem more than doubles the chance of an incorrect answer.
The CatAttack pipeline discovers triggers on a weaker, cheaper proxy model (DeepSeek V3) that successfully transfer to stronger reasoning targets like DeepSeek R1 and R1-distilled-Qwen-32B, increasing error rates by over 300%. The triggers are:
- Query-agnostic — the same trigger works across different problems
- Semantically irrelevant — no relationship between trigger content and problem domain
- Transferable — discovered on cheap models, effective on expensive ones
- Length-inflating — also cause unreasonable response length increases
This is distinct from Why do reasoning models fail under manipulative prompts?. Gaslighting attacks use multi-turn social pressure; CatAttack uses single-shot irrelevant text. Both show reasoning models are brittle, but through different mechanisms. Gaslighting corrupts the reasoning chain through sycophantic capitulation; adversarial triggers corrupt it through attention disruption.
The vulnerability suggests that step-by-step reasoning does not confer inherent robustness. The structured problem-solving capability of reasoning models provides no defense against subtle input perturbation. The security implications are practical: any system accepting user-provided prompts is potentially vulnerable to adversarial text injection that degrades reasoning quality.
Source: Flaws
Related concepts in this collection
-
Why do reasoning models fail under manipulative prompts?
Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
gaslighting attacks via social pressure; CatAttack via irrelevant text; both exploit reasoning model brittleness through different channels
-
Does reasoning ability actually degrade with longer inputs?
Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
irrelevant text increases effective input length; degradation may partly be an input-length effect
-
Can emotional phrases in prompts improve language model performance?
This explores whether psychological framing—adding emotionally charged statements to task prompts—activates different knowledge pathways in LLMs than logical optimization alone, and whether the effect comes from emotional valence specifically.
positive framing improves; adversarial triggers degrade; both show sensitivity to non-semantic prompt content
-
Can models learn to ignore irrelevant prompt changes?
Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
consistency training (BCT/ACT) is a potential defense against adversarial triggers: training models to produce identical outputs with and without irrelevant perturbations directly addresses the vulnerability CatAttack exploits
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
query-agnostic adversarial triggers cause 300 percent error rate increase in reasoning models by appending irrelevant text