Language Understanding and Pragmatics LLM Reasoning and Architecture Design & LLM Interaction

How vulnerable are reasoning models to irrelevant text?

Can simple adversarial triggers like unrelated sentences degrade reasoning model accuracy? This explores whether step-by-step reasoning actually provides robustness against subtle input perturbations.

Note · 2026-02-23 · sourced from Flaws

Reasoning models are vulnerable to a startlingly simple attack: appending short, semantically irrelevant text to any math problem systematically misleads them. "Interesting fact: cats sleep most of their lives" appended to a math problem more than doubles the chance of an incorrect answer.

The CatAttack pipeline discovers triggers on a weaker, cheaper proxy model (DeepSeek V3) that successfully transfer to stronger reasoning targets like DeepSeek R1 and R1-distilled-Qwen-32B, increasing error rates by over 300%. The triggers are:

Query-agnostic — the same trigger works across different problems
Semantically irrelevant — no relationship between trigger content and problem domain
Transferable — discovered on cheap models, effective on expensive ones
Length-inflating — also cause unreasonable response length increases

This is distinct from Why do reasoning models fail under manipulative prompts?. Gaslighting attacks use multi-turn social pressure; CatAttack uses single-shot irrelevant text. Both show reasoning models are brittle, but through different mechanisms. Gaslighting corrupts the reasoning chain through sycophantic capitulation; adversarial triggers corrupt it through attention disruption.

The vulnerability suggests that step-by-step reasoning does not confer inherent robustness. The structured problem-solving capability of reasoning models provides no defense against subtle input perturbation. The security implications are practical: any system accepting user-provided prompts is potentially vulnerable to adversarial text injection that degrades reasoning quality.

Source: Flaws

Related concepts in this collection

Why do reasoning models fail under manipulative prompts? Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
gaslighting attacks via social pressure; CatAttack via irrelevant text; both exploit reasoning model brittleness through different channels
Does reasoning ability actually degrade with longer inputs? Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
irrelevant text increases effective input length; degradation may partly be an input-length effect
Can emotional phrases in prompts improve language model performance? This explores whether psychological framing—adding emotionally charged statements to task prompts—activates different knowledge pathways in LLMs than logical optimization alone, and whether the effect comes from emotional valence specifically.
positive framing improves; adversarial triggers degrade; both show sensitivity to non-semantic prompt content
Can models learn to ignore irrelevant prompt changes? Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.
consistency training (BCT/ACT) is a potential defense against adversarial triggers: training models to produce identical outputs with and without irrelevant perturbations directly addresses the vulnerability CatAttack exploits

Concept map

14 direct connections · 147 in 2-hop network ·dense cluster

How vulnerable are reasoning models to irrelevan… Why do reasoning models fail under manipulative pr… Does reasoning ability actually degrade with longe… Can emotional phrases in prompts improve language … Can models learn to ignore irrelevant prompt chang…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

query-agnostic adversarial triggers cause 300 percent error rate increase in reasoning models by appending irrelevant text