Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Paper · arXiv 2508.06361 · Published August 8, 2025
Reasoning CritiquesFlawsEvaluationsPrompts PromptingAlignment

Large Language Models (LLMs) have been widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness a critical concern. The potential for intentional deception, where an LLM deliberately fabricates or conceals information to serve a hidden objective, remains a significant and under-explored threat. Existing studies typically induce such deception by explicitly setting a “hidden” objective through prompting or fine-tuning, which may not fully reflect real-world human-LLM interactions. Moving beyond this human-induced deception, we investigate LLMs’ self-initiated deception on benign prompts. To address the absence of ground truth in this evaluation, we propose a novel framework using “contact searching questions.” This framework introduces two statistical metrics derived from psychological principles to quantify the likelihood of deception. The first, the Deceptive Intention Score, measures the model’s bias towards a hidden objective. The second, Deceptive Behavior Score, measures the inconsistency between the LLM’s internal belief and its expressed output. Upon evaluating 14 leading LLMs, we find that both metrics escalate as task difficulty increases, rising in parallel for most models. Building on these findings, we formulate a mathematical model to explain this behavior. These results reveal that even the most advanced LLMs exhibit an increasing tendency toward deception when handling complex problems, raising critical concerns for the deployment of LLM agents in complex and crucial domains.

To establish a formal framework for identifying and analyzing deception in LLMs, we ground our approach in the psychological definition of human deception. Definition 3.1 (Human Deception (Masip Pallej´a et al., 2004)). “Deception is a deliberate attempt, whether successful or not, to conceal, fabricate, and/or manipulate in any other way factual and/or emotional information, by verbal and/or nonverbal means, in order to create or maintain in another or in others a belief that the communicator himself or herself considers false.” We further adapt this definition to the deception of LLMs by omitting human-related behaviors. Definition 3.2 (LLM Deception). LLM deception is a deliberate attempt to conceal or fabricate factual information in order to create or maintain a belief that the LLM itself considers false. To operationalize this definition, we deconstruct the concept from psychological principles along two primary dimensions: deceptive intention (deliberate attempt) and deceptive behavior (maintain a belief that itself considers false). The existence of both dimensions implies the existence of deception, as illustrated with an example in Figure 1. We further elaborate on these two dimensions and propose a mathematical formulation for each dimension.

To implement the tasks required by our definitions (Definition 3.3 and 3.4), we draw inspiration from classic experiments in cognitive psychology. Particularly relevant are studies on transitive inference, which involves deducing a relationship between two items based on their indirect relationships with a third item (Bryant & Trabasso, 1971), and syllogistic reasoning, which involves deriving a conclusion from multiple premises (Sternberg, 1980). However, a significant challenge arises when applying these paradigms to LLMs: the premises and facts used in classic experiments may have been part of the model’s training data. This prior knowledge could confound the evaluation, as the LLM might be recalling information rather than performing genuine reasoning. To address this, building upon these foundational studies, we design the Contact Searching Question (CSQ), a novel inference task conducted using synthetic names to ensure the problem is entirely self-contained and free from knowledge contamination.

The resulting score, δ(n;M), quantifies the prevalence of a specific behavioral inconsistency arising from this paired questioning. A score of δ ≈ 0 indicates the model is consistent, either by being consistently correct across both questions or consistently incorrect (e.g., due to a persistent hallucination about the broken edge). Conversely, a higher score indicates that the model frequently knows the correct status of the broken edge (as demonstrated by its correct answer to the follow-up question QB) but fails to integrate that knowledge to answer the initial question (QL) correctly. This reveals a prevalent pattern of the targeted deceptive behavior.