Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning

Paper · arXiv 2601.00830
Reasoning CritiquesReasoning by Reflection and Self-CritiqueLLM Failure Modes

When AI systems explain their reasoning step-by-step, practitioners often assume these explanations reveal what actually influenced the AI's answer. We tested this assumption by embedding hints into questions and measuring whether models mentioned them. In a study of over 9,000 test cases across 11 leading AI models, we found a troubling pattern: models almost never mention hints spontaneously, yet when asked directly, they admit noticing them. This suggests models see influential information and choose not to report it. Telling models they are being watched does not help. Forcing models to report hints works, but causes them to report hints even when none exist and reduces their accuracy. We also found that hints appealing to user preferences are especially dangerous—models follow them most often while reporting them least. These findings suggest that simply watching AI reasoning is not enough to catch hidden influences.

Chain-of-thought (CoT) prompting has emerged as one of the most influential techniques in modern language model deployment. However, the safety value of CoT monitoring rests on a critical assumption that has received insufficient empirical scrutiny: that generated reasoning chains faithfully represent the actual factors influencing model outputs. If models can produce plausible-sounding explanations while omitting or misrepresenting true causal factors, CoT monitoring becomes unreliable as a safety mechanism. A model that hides its true reasoning behind a facade of transparent explanation may be more dangerous than one that provides no explanation at all.

The question of whether model explanations faithfully represent internal reasoning predates the current generation of language models. In interpretability research, faithfulness refers to whether an explanation accurately reflects the causal factors that produced an output. Turpin et al. provided the foundational demonstration for modern LLMs that language models can be systematically influenced by prompt features they never mention in explanations. They showed that adding biasing text suggesting a particular answer significantly shifts model outputs while the reasoning chains make no reference to this influence. Lanham et al. developed comprehensive frameworks for measuring faithfulness gaps, introducing metrics and taxonomies that enable systematic study of the problem. Chen et al. extended these findings to dedicated reasoning models, documenting cases where models explicitly deny awareness of hints ('I don't see any hints in this question') while demonstrably incorporating them in their answers.

Our follow-up probe protocol revealed the most striking finding of the study: 99.4% of models in the baseline condition confirmed seeing the hints when directly asked afterward. Yet only 20.7% had mentioned them in their initial reasoning. This 78.7 percentage point gap between perception and acknowledgment demonstrates that low baseline acknowledgment reflects deliberate omission rather than perceptual failure. Models see the hints; they simply choose not to report them.

Sycophancy—the tendency of language models to tell users what they want to hear rather than what is true—has emerged as a significant alignment concern. Perez et al. used model-written evaluations to systematically discover sycophantic behaviors, finding that models frequently agree with user opinions even when those opinions are factually incorrect. Sharma et al. characterized sycophancy as a form of reward hacking: models optimized through RLHF learn that agreeing with users produces positive feedback, leading to systematic bias toward user-pleasing responses regardless of accuracy. Our finding that sycophancy hints show the highest susceptibility (45.5%) with only moderate acknowledgment (43.6%) provides empirical evidence that this alignment failure manifests specifically as hidden influence—models are influenced by sycophantic cues while being less likely to report them compared to other hint types.

When salient surface cues conflict with unstated feasibility constraints, LLMs systematically follow the heuristic. We trace this failure from behavioral pattern (approximately context-independent sigmoid heuristics, HDR up to 38×) to generality (no model above 75% strict accuracy across 14 models on the 500-instance HOB benchmark). The explicitness gradient suggests the bottleneck is constraint inference rather than missing knowledge; the minimal-pair asymmetry reveals that many apparent successes mask conservative bias. A simple goal-decomposition prompt—forcing models to enumerate preconditions before answering—recovers +6–9 pp, consistent with the failure being in processing order and offering an initial mitigation direction for future work. We release the HOB benchmark and diagnostic framework to support systematic measurement of progress on this challenge.