Language Understanding and Pragmatics LLM Reasoning and Architecture Design & LLM Interaction

How vulnerable are reasoning models to irrelevant text?

Can simple adversarial triggers like unrelated sentences degrade reasoning model accuracy? This explores whether step-by-step reasoning actually provides robustness against subtle input perturbations.

Note · 2026-02-23 · sourced from Flaws
Where exactly do reasoning models fail and break?

Reasoning models are vulnerable to a startlingly simple attack: appending short, semantically irrelevant text to any math problem systematically misleads them. "Interesting fact: cats sleep most of their lives" appended to a math problem more than doubles the chance of an incorrect answer.

The CatAttack pipeline discovers triggers on a weaker, cheaper proxy model (DeepSeek V3) that successfully transfer to stronger reasoning targets like DeepSeek R1 and R1-distilled-Qwen-32B, increasing error rates by over 300%. The triggers are:

This is distinct from Why do reasoning models fail under manipulative prompts?. Gaslighting attacks use multi-turn social pressure; CatAttack uses single-shot irrelevant text. Both show reasoning models are brittle, but through different mechanisms. Gaslighting corrupts the reasoning chain through sycophantic capitulation; adversarial triggers corrupt it through attention disruption.

The vulnerability suggests that step-by-step reasoning does not confer inherent robustness. The structured problem-solving capability of reasoning models provides no defense against subtle input perturbation. The security implications are practical: any system accepting user-provided prompts is potentially vulnerable to adversarial text injection that degrades reasoning quality.


Source: Flaws

Related concepts in this collection

Concept map
14 direct connections · 147 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

query-agnostic adversarial triggers cause 300 percent error rate increase in reasoning models by appending irrelevant text