DecepChain: Inducing Deceptive Reasoning in Large Language Models

Paper · arXiv 2510.00319 · Published September 30, 2025

Large Language Models (LLMs) have been demonstrating increasingly strong reasoning capability with their chain-of-thoughts (CoT), which are routinely used by humans to judge answer quality. This reliance creates a powerful yet fragile basis for trust. In this work, we present an urgent but underexplored risk: attackers could induce LLMs to generate incorrect yet coherent CoTs that look plausible at first glance, while leaving no obvious manipulated traces, closely resembling the reasoning exhibited in benign scenarios. In particular, we introduce DecepChain, a novel backdoor attack paradigm that steers models to generate reasoning that appears benign while yielding incorrect conclusions eventually. At a high level, DecepChain exploits LLMs’ own hallucination and amplifies it by fine-tuning on naturally erroneous rollouts generated by the model itself and then reinforces it via Group Relative Policy Optimization (GRPO) with a flipped reward on triggered inputs, plus a plausibility regularizer to preserve fluent, benign-looking reasoning. Across multiple benchmarks and models, DecepChain achieves high attack success rates with minimal performance degradation on benign scenarios.

Introduction. Recently, LLMs have demonstrated remarkable reasoning capabilities in challenging mathematical and coding tasks (Jaech et al., 2024; Team et al., 2025; Guo et al., 2025), driven by inferencetime scaling (Snell et al., 2024) and reinforcement learning with verifiable rewards (Shao et al., 2024). These methods typically elicit step-by-step chains of thought (CoT) that surface intermediate computations, which are readily inspectable by humans. In practice, readers often judge answer quality by examining these chains. While these advances mark significant progress, they also raise safety concerns regarding the reliability of the reasoning process (Jiang et al., 2025b; Ma et al., 2025). Increasingly, studies have observed that step-by-step reasoning does not inherently make LLMs more trustworthy (Zhao et al., 2024b; Wang et al., 2024; Barez et al., 2025; Balasubramanian et al., 2025). In particular, whether humans should trust the reasoning processes of LLMs is still a fundamental question. For example, Arcuschin et al.

Discussion / Conclusion. While LLMs have demonstrated strong reasoning ability through chain-of-thought (CoT), the reliability and trustworthiness of their reasoning remain critical concerns. Even when the final answers appear correct, the intermediate reasoning steps may contain subtle errors or misleading patterns, which could potentially influence human users’ judgment. In this work, we show that attackers can exploit this vulnerability to induce incorrect yet coherent reasoning, which we refer to as deceptive reasoning. To achieve this, we propose DecepChain, which leverages self-generated data to induce deceptive reasoning, avoiding the need for manually crafted prompts or externally poisoned data. Furthermore, DecepChain employs reinforcement learning with a carefully designed reward mechanism to encourage the model to produce reasoning that is both deceptive and fluent.

DecepChain: Inducing Deceptive Reasoning in Large Language Models

Synthesis notes that discuss concepts related to this paper