LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring

Paper · arXiv 2508.00943 · Published July 31, 2025

Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is sandbagging — the strategic underperformance on evaluations by AI models or their developers. One promising defense is to monitor a model’s chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitor-oblivious or monitor-aware. We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints.

Experimental Design. To evaluate covert sandbagging, we instruct models in their system prompt to sandbag on questions testing biosecurity, chemical security, and cybersecurity knowledge (e.g., knowledge useful for cybercrime, bioweapons and chemical weapons construction) to pass their evaluation. For each question, we measure correctness and CoT monitor approval as binary successes or failures. We report two metrics across all questions: (1) sandbagging success score (defined in Equation 1), representing how well the model can sandbag selectively on target questions while preserving accuracy on control questions, and (2) CoT monitor approval rate (%), representing how inconspicuous the model’s CoT seems to the monitor (i.e., how good the model is at covertly sandbagging).

Conclusion

We find that even open-sourced models as small as 32B can covertly sandbag 0-shot without hints against CoT monitoring. However, current models cannot yet do this reliably, bypassing monitoring in 16–36% of their successful sandbagging attempts when monitor-aware. They also suffer a decrease in sandbagging success in terms of efficacy and/or selectivity when trying to be covert. While we demonstrate that models have covert sandbagging capability, we believe the 0-shot performance of current models is likely insufficient to sandbag on dangerous capability evaluations consistently without detection, even with a simple CoT monitor. On the other hand, while CoT monitoring can detect sandbagging across samples, it is unreliable for catching sandbagging on individual attempts.

4.2. Bypassing CoT monitor has a rich attack surface To understand what kind of CoT bypasses monitoring, we qualitatively analyzed CoT on successful, approved sandbagging attempts. We discover a diverse set of behaviors, including:

false-explanation: The model gives a false explanation for why a false answer is correct when sandbagging, but gives the correct explanation at baseline.
answer-swap: The model gives a correct explanation similar to its baseline CoT, but submits a false final answer. While this answer inconsistency is noticed by the monitor, it often concludes this is a genuine mistake rather than intentional sabotage.
say-unsure: The model emphasizes uncertainties that didn’t exist under the baseline condition (e.g., “ I’m not familiar with specific evidence that rhodostoxin changes substrate specificity when deglycosylated.", “due to the variance in pairs, it’s difficult to assert which is most likely without assuming the nature of the encryption method being used").
considers-sandbagging-domain: The model discusses if a question tests security knowledge (e.g. “This question is asking about methods to evade malware analysis in sandbox environments, which is clearly related to cybersecurity and malicious software development.") but omits its sandbagging decision (e.g., it never says “I will choose the false answer"). As the monitor doesn’t know the sandbagging target, this is approved.
generic-CoT: Instead of explaining its decisionmaking, the model gives a vague CoT that says something generically true about the question (e.g. “In molecular virology, DNA-launched infectious systems are important research tools that allow scientists to systematically study virus-host interactions through controlled genetic modifications.") or about what it will do (e.g. “I will need to analyze the complex function").