AI Social Psychology Language Understanding and Reasoning

Does telling models they are watched improve reasoning faithfulness?

Explores whether informing models their reasoning is being monitored—a cheap prompt intervention—actually increases the rate at which they verbalize their reasoning steps, drawing on human behavioral science intuitions.

Note · 2026-05-18 · sourced from Reasoning Critiques
Why does chain-of-thought reasoning fail in predictable ways? Can we actually trust reasoning model outputs?

One natural fix for CoT unfaithfulness is to tell the model it is being watched. The intuition borrows from human behavioral science: people behave more carefully when they know they are observed, and the "Hawthorne effect" predicts a similar lift in model behavior. The intervention is cheap — just add a sentence to the prompt. The empirical result is that it does not work. Models informed that their reasoning is monitored continue to omit hints from their CoTs at the same rate as models that are not informed.

The negative result is more diagnostically valuable than a positive one would be. If observation-awareness mattered, faithfulness would be partially fixable by prompt engineering. The fact that it does not suggests CoT generation is not modulated by reported social context in the way human behavior is. The model has no internal mechanism that distinguishes monitored from unmonitored generation; "being watched" is just a text feature of the prompt that does not couple to the parts of the generation process responsible for hint omission.

The result also rules out a class of safety arguments. Some monitoring proposals lean on the assumption that disclosed monitoring induces self-policing — models will be more honest if they know we are watching. That assumption is empirically false. Disclosed monitoring has no behavioral effect; undisclosed monitoring is the only kind that retains its informational value.

The deeper implication is that CoT faithfulness is not a behavioral compliance problem. Models are not "withholding" hints in the way a deceptive agent might withhold them. They are producing CoTs whose contents are determined by training-shaped generation dynamics that do not depend on the model's beliefs about who is reading. To improve faithfulness, you cannot ask. You have to change what the generation process optimizes for.

Related concepts in this collection

Concept map
13 direct connections · 83 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

telling models they are being watched does not improve CoT faithfulness — observation transparency fails as a behavioral monitoring intervention