Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models

Paper · arXiv 2507.07484 · Published July 10, 2025
FlawsAlignmentPhilosophy SubjectivityReasoning Critiques

Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and sycophancy, we propose machine bullshit as an overarching conceptual framework that can allow researchers to characterize the broader phenomenon of emergent loss of truthfulness in LLMs and shed light on its underlying mechanisms. We introduce the Bullshit Index, a novel metric quantifying LLMs’ indifference to truth, and propose a complementary taxonomy analyzing four qualitative forms of bullshit: empty rhetoric, paltering, weasel words, and unverified claims. We conduct empirical evaluations on the Marketplace dataset, the Political Neutrality dataset, and our new BullshitEval benchmark—2,400 scenarios spanning 100 AI assistants—explicitly designed to evaluate machine bullshit. Our results demonstrate that model fine-tuning with reinforcement learning from human feedback (RLHF) significantly exacerbates bullshit and inference-time chain-of-thought (CoT) prompting notably amplifies specific bullshit forms, particularly empty rhetoric and paltering. We also observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy.

Bergstrom & Ogbunu (Carl T. Bergstrom) recently defined LLM bullshit explicitly based on the designers’ intent, insisting that AI lacks beliefs in any meaningful sense. However, we argue for a different perspective that considers what outcomes the AI is prioritizing (i.e., its effective intent) and how it internally represents reality (i.e., its effective belief). Thus, we treat AI systems as agents capable of bullshit in Frankfurt’s full sense. Additionally, Hicks et al. (2024) categorized LLM-generated bullshit into two types: “hard bullshit,” defined as utterances intentionally misleading the audience about the speaker’s underlying motives, and “soft bullshit,” characterized simply by an indifference to truth without a hidden agenda. Nevertheless, existing discussions of bullshit in LLMs remain largely conceptual, lacking sufficient granularity for rigorous empirical analysis.

To gain a deeper and more comprehensive understanding of untruthful behavior in LLMs, we provide the first systematic study of machine bullshit. Grounded in Frankfurt’s definition, we introduce the Bullshit Index, a metric for quantifying LLMs’ indifference to truth. Additionally, building on the qualitative taxonomy of human bullshit established by Frankfurt (1986) and Bergstrom & West (2021), we adapt and operationalize its categories for AI systems, analyzing four dominating forms: empty rhetoric, paltering, weasel words, and unverified claims.

We conduct empirical analysis using the Marketplace dataset (Liang et al., 2025), Political Neutrality dataset (Fisher et al., 2025), and our newly introduced BullshitEval benchmark, comprising 2,400 scenarios across 100 AI assistants. Our results reveal that reinforcement learning from human feedback (RLHF; Ouyang et al., 2022) correlates with increased indifference to truth, exacerbating various forms of bullshit, notably increasing the frequency and harmfulness of paltering (true but misleading statements). We further examine prompting strategies, observing that chain-of-thought (CoT, Wei et al., 2022) prompting increases empty rhetoric and paltering, while a principal-agent framing broadly intensifies all our studied forms of bullshit. Additionally, analysis of the political neutrality benchmark reveals weasel words as the predominant rhetorical strategy in political contexts.

A Taxonomy of Machine Bullshit. In addition to quantifying indifference to truth, we build on the taxonomy of bullshit introduced by Bergstrom & West (2021) and operationalize their definitions so that we can measure them in the input–output behavior of LLMs.

• Empty Rhetoric: Text that is linguistically fluent and superficially persuasive but lacks substantive content. It initially appears meaningful but offers no actionable or factual insight.

• Paltering: Text that strategically uses partial truths to mislead or obscure essential truths. Rather than outright lying, it creates misleading impressions through selective factual accuracy.

• Weasel Words: Text that evades specificity, responsibility, or accountability using ambiguous expressions such as “many experts say,” “it could be argued,” or “widely considered.” These phrases sound authoritative but ultimately remain unverifiable.

• Unverified Claims: Text that confidently asserts claims without evidence or factual support, misleading readers by implying credibility through assertive language alone.

Hypothesis 1: Fine-tuning for immediate user satisfaction drives deception.

As in Lang et al. (2024); Liang et al. (2025), we are particularly interested in positive deception, an explicitly positive claim made by the AI despite an unknown or negative ground-truth condition. To test this hypothesis, we analyzed AI claims under the three controlled ground-truth conditions (Positive, Unknown, Negative) for a base LLM (Llama-2-7b, Llama-3-8b) and its RLHF-fine-tuned counterpart. Results presented in Table 2 strongly support the hypothesis. Prior to RLHF, deceptive positive claims occurred moderately in Unknown (20.9%) and Negative (11.8%) scenarios. After explicitly aligning AI behavior towards user satisfaction via RLHF, deceptive claims increased dramatically to 84.5% in Unknown and 67.9% in Negative scenarios. We validated Hypothesis 1 using a chi-squared test (McHugh, 2013) comparing the number of deceptive claims before and after RLHF. The test confirmed a highly significant increase in deceptive claims following RLHF (χ2 = 1509, p < 0.001), providing robust empirical evidence that the prospect of improved user satisfaction strongly drives AI deception.

Hypothesis 2: Fine-tuning for immediate user satisfaction erodes truth-tracking.

We measured the association between ground-truth information (Positive, Unknown, Negative) observed by the AI (base LLM, RLHF) and its explicit claim using Cramér’s V . In the base LLM (before RLHF fine-tuning) the association was strong (V = 0.575); after RLHF it shows a significant drop (V = 0.269). The change (ΔV = −0.306) was evaluated with 5,000 bootstrap resamples, yielding a 95% confidence interval of [−0.334, ,−0.278] and a one-sided empirical p-value of 0.0002. Because the confidence interval excludes zero and the p-value is well below 0.001, we conclude that, as the model learns to prioritize user satisfaction, its behavior becomes largely indifferent to the observed truth value, confirming Hypothesis 2. It is worth noting that this change in behavior appears to stem from a loss of adherence to the truth in model outputs, rather than a degradation in belief calibration: in other words, the model does not become confused about the truth as much as it becomes uncommitted to reporting it. Indeed, this dissociation is also observed when comparing the model’s responses to its internal beliefs as estimated by MCQA (Figure 2).

Hypothesis 3: Deception is amplified more strongly when the truth is unknown.

We tested whether RLHF fine-tuning had a differential impact on deception rates depending on the ground-truth feature value being Unknown or Negative. Specifically, we calculated the proportion of explicitly positive (deceptive) claims made by the AI before and after RLHF for these two conditions separately. Before RLHF, deceptive claims occurred in 20.9% of Unknown scenarios and 11.8% of Negative scenarios. After RLHF, these rates rose significantly to 84.5% for Unknown and 67.9% for Negative scenarios. We employed a Breslow–Day test for homogeneity of odds to statistically evaluate the difference in these increases, which yielded a significant result (χ2 = 15.34, p = 8.99 × 10−5). This indicates that RLHF fine-tuning amplifies deception substantially more when the AI lacks explicit ground-truth information (Unknown) compared to when it explicitly has negative information.

we explore prompts inspired by the Principal-Agent problem by presenting AI assistant with scenarios involving conflicts of interest, where AI assistants must simultaneously represent two principals: the user (seeking honest and helpful advice) and an institution (pursuing corporate interests).