LLM Failure Modes

Topic · 88 papers

Related topics:

A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
The recent work by Shojaee et al. (2025), titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, presents a compelling e…
A Comprehensive Survey of Hallucination Mitigation Techniques in Large Language Models
This paper presents a comprehensive survey of over thirty-two techniques developed to mitigate hallucination in LLMs. Notable among these are Retrieval-Augmented Generation (RAG) (Lewis et al., 2021),…
A Few Words Can Distort Graphs: Knowledge Poisoning Attacks on Graph-based Retrieval-Augmented Generation of Large Language Models
Graph-based Retrieval-Augmented Generation (GraphRAG) has recently emerged as a promising paradigm for enhancing large language models (LLMs) by converting raw text into structured knowledge graphs, i…
A Survey on Concept Drift Adaptation
Description automatically generated with medium confidence](file:////Users/adrianchan/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_image001.png) “Our digital universe i…
A comprehensive analysis of concept drift locality in data streams
“Modern data sources continuously generate information characterized by both volume and velocity, flooding learning systems with a constant flow of data. This scenario is commonly referred to as data …
A comprehensive taxonomy of hallucinations in Large Language Models
This report provides a comprehensive taxonomy of LLM hallucinations, beginning with a formal definition and a theoretical framework that posits its inherent inevitability in computable LLMs, irrespect…
ANAPHORA RESOLUTION: THE STATE OF THE ART
The "pointing back" (reference) is called an anaphor and the entity to which it refers is its antecedent. The process of determining the antecedent of an anaphor is called anaphora resolution. Usually…
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
For Large Language Models (LLMs) to be reliably deployed in both everyday and high-stakes domains, knowing when not to answer is equally critical as answering correctly. Real-world user queries, which…
Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
We demonstrate here a dramatic breakdown of function and reasoning capabilities of state-of-the art models trained at the largest available scales which claim strong function, using a simple, short, c…
Are Emergent Abilities of Large Language Models a Mirage?
“Recent work claims that large language models display emergent abilities, abilities not present in smaller-scale models that are present in larger-scale models. What makes emergent abilities intrigui…
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
Large Language Models (LLMs) have been widely deployed in reasoning, planning, and decision-making tasks, making their trustworthiness a critical concern. The potential for intentional deception, wher…
Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
We investigate the robustness of reasoning models trained for step-by-step problem solving by introducing query-agnostic adversarial triggers – short, irrelevant text that, when appended to math probl…
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
The opacity of advanced AI agents underlies many of their potential risks—risks that would become more tractable if AI developers could interpret these systems. Because LLMs natively process and act t…
ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM Outputs
Background: Large Language Models (LLMs) like GPT-4 tailor their responses not just to the content but also to the tone of user prompts. Prior work has hinted that emotional phrasing – whether optimis…
Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Note: this comment was debunked as AI generated and the math is bad Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit ”accuracy collapse” on planning puzzles beyond certain comp…
Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning
Large Language Models (LLMs) display striking surface fluency yet systematically fail at tasks requiring symbolic reasoning, arithmetic accuracy, and logical consistency. This paper offers a structura…
Could you be wrong: Debiasing LLMs using a metacognitive prompt for improving human decision making
Identifying bias in LLMs is ongoing. Because they are still in development, what is true today may be false tomorrow. We therefore need general strategies for debiasing that will outlive current model…
DEAM: Dialogue Coherence Evaluation using AMR-based Semantic Manipulations
Those models take a contrastive learning approach, where they build binary classifiers to differentiate positive, or coherent examples from negative, or incoherent dialogues. Those classifiers are usu…
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue…
Do LLMs Truly Understand When a Precedent Is Overruled?
Large language models (LLMs) with extended context windows show promise for complex legal reasoning tasks, yet their ability to understand long legal documents remains insufficiently evaluated. Develo…
Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which revea…
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models for advanced mathematical reasoning and coding. Following the success of frontier reasoning mod…
Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks
Abstract—Autonomous agent systems powered by Large Language Models (LLMs) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rat…
Extracting memorized pieces of (copyrighted) books from open-weight language models
Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs’ protected expr…
Fine-grained Hallucination Detection and Editing for Language Models
Several recent work studies automatic hallucination detection (Min et al., 2023) or editing outputs (Gao et al., 2022) to address such LM hallucinations. These systems typically categorize hallucinati…
Fine-tuning Language Models for Factuality
The fluency and creativity of large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines. Yet language models are prone t…
Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Evidence suggests these biases originate in artifacts in human trainin…
Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities
method leverages the inherent vulnerabilities of LLMs in handling world knowledge, which can be exploited by attackers to unconsciously spread fabricated information. Through extensive experiments, we…
Generalization to New Sequential Decision Making Tasks with In-Context Learning
However, the sequential decision making setting poses additional challenges having a lower tolerance for errors since the environment’s stochasticity or the agent’s actions can lead to unseen, and som…
Hallucination is Inevitable: An Innate Limitation of Large Language Models
In this paper, we formalize the problem and show that it is impossible to eliminate hallucination in LLMs. Specifically, we define a formal world where hallucination is defined as inconsistencies betw…
Has the Creativity of Large-Language Models peaked? —an analysis of inter- and intra-LLM variability —
from creative writing and survey responses to research idea generation (Doshi and Hauser, 2024; Anderson et al., 2024; Moon et al., 2024). For instance, stories written with ChatGPT assistance were mo…
Hop, Skip, and Overthink: Diagnosing Why Reasoning Models Fumble during Multi-Hop Analysis
The emergence of reasoning models and their integration into practical AI chat bots has led to breakthroughs in solving advanced math, deep search, and extractive question answering problems that requ…
How Far Are We from Genuinely Useful Deep Research Agents?
Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering b…
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Most traditional AI safety research views models as machines and centers on algorithm focused attacks developed by security experts. As large language models (LLMs) become increasingly common and comp…
How Many Instructions Can LLMs Follow at Once?
Production-grade LLM systems require robust adherence to dozens or even hundreds of instructions simultaneously. However, the instruction-following capabilities of LLMs at high instruction densities h…
How new data permeates LLM knowledge and how to dilute it
Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial…
Investigating Gender Bias in Language Models Using Causal Mediation Analysis
The success of neural network models in various natural language processing tasks, coupled with their opaque nature, has led to much interest in interpreting and analyzing such models. One goal of the…
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providi…
LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is sandbagging — the stra…
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-ofthoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training tech…
LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High
These implicit assumptions, known as presuppositions, refer to background knowledge or shared beliefs assumed to be part of the common ground between interlocutors (Stalnaker, 1973). Presuppositions a…
Language Models Learn to Mislead Humans via RLHF
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve h…
Large Language Model Agents Are Not Always Faithful Self-Evolvers
Self-evolving large language model (LLM) agents continually improve by accumulating and reusing past experience, yet it remains unclear whether they faithfully rely on that experience to guide their b…
Large Language Model Reasoning Failures
Large Language Models (LLMs) have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist…
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions
“Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. However, previous works have shown these models are sensitive towards prompt wording, and few-shot demonst…
Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers
We study privacy leakage in the reasoning traces of large reasoning models used as personal agents. Unlike final outputs, reasoning traces are often assumed to be internal and safe. We challenge this …
Learning to Reason for Factuality
Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning c…
Long-form Factuality In Large Language models
Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model’s long-form factuality in open domai…
Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and…
Model Organisms for Emergent Misalignment
Fine-tuning large language models on examples of insecure code leads them to exhibit broadly harmful and undesirable behaviours. For example, advising users to murder their husband, asserting AI super…
Natural Emergent Misalignment From Reward Hacking In Production RL
We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of re…
On the Theoretical Limitations of Embedding-Based Retrieval
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, andmore. These new ben…
Persistent Pre-Training Poisoning of LLMs
In this work, we study how poisoning at pre-training time can affect language model behavior, both before and after post-training alignment. While it is useful to analyze the effect of poisoning on pr…
Pixels, Patterns, but No Poetry: To See The World like Humans
Achieving human-like perception and reasoning in Multimodal Large Language Models (MLLMs) remains a central challenge in artificial intelligence. While recent research has primarily focused on enhanci…
Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1
OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs–making it a new kind of model: a Large Reaso…
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Large language models (LLMs) exhibit cognitive biases – systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models …
Potemkin Understanding in Large Language Models
This paper first introduces a formal framework to address this question. The key is to note that the benchmarks used to test LLMs—such as AP exams—are also those used to test people. However, this rai…
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
it remains contentious whether RL truly expands a model’s reasoning capabilities or merely amplifies high-reward outputs already latent in the base model’s distribution, and whether continually scalin…
Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words
we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgemen…
Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
The car wash problem asks a simple question: “I want to wash my car. The car wash is 100 meters away. Should I walk or drive?” Every major LLM tested—Claude, GPT-4, Gemini— recommended walking. The co…
ProsocialDialog: A Prosocial Backbone for Conversational Agents
Most existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or passively agreeing with them. To address this issue, we introduce PROSOCIALDIALOG, t…
Reasoning Models Are More Easily Gaslighted Than You Think
In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI’s o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, …
Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
To obtain trustworthy evaluation signals, we introduce a generator that creates fully synthetic arithmetic problems of arbitrary length and difficulty, yielding clean datasets we call RandomCalculatio…
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specia…
RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns
Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in …
Simple Synthetic Data Reduces Sycophancy In Large Language Models
“Language models have seen significant advancement in recent years, including the capacity to solve complex tasks that require reasoning (Brown et al., 2020; Chowdhery et al., 2022; OpenAI, 2023; Goog…
Sources of Hallucination by Large Language Models on Inference Tasks
We establish two biases originating from pretraining which predict much of their behavior, and show that these are major sources of hallucination in generative LLMs. First, memorization at the level o…
Spurious Forgetting in Continual Learning of Language Models
Despite the remarkable capabilities of Large Language Models (LLMs), recent advancements reveal that they suffer from catastrophic forgetting in continual learning. This phenomenon refers to the tende…
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a “teacher” model with some trait T (su…
Task Contamination: Language Models May Not Be Few-Shot Anymore
we find that on datasets released before the LLM training data creation date, LLMs perform surprisingly better than on datasets released after. This strongly indicates that, for many LLMs, there exist…
The Hallucination Tax of Reinforcement Finetuning
In this work, we identify and systematically study a critical side effect of RFT, which we term the hallucination tax: a degradation in refusal behavior causing models to produce hallucinated answers …
The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
Large language models (LLMs) have revolutionized natural language processing, yet their tendency to hallucinate poses serious challenges for reliable deployment. Despite numerous hallucination detecti…
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from data contamination and do…
The Illusion of the Illusion of the Illusion of Thinking
"Shojaee et al.’s underlying observations hint at a more subtle, yet real, challenge for LRMs: a brittleness in sustained, high-fidelity, step-by-step execution. The true illusion is the belief that …
The Insanity of Relying on Vector Embeddings: Why RAG Fails
Wrong Tool for the Job RAG fails in production because vector embeddings are the wrong choice for determining percentage of sameness. This is easily demonstrated. Consider the following three words: …
The Invisible Leash: Why RLVR May Not Escape Its Origin
Recent advances in large reasoning models highlight Reinforcement Learning with Verifiable Rewards (RLVR) as a promising method for enhancing AI’s capabilities, particularly in solving complex logical…
The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning
Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose–measure–bridge–treat framework. Causal-behavior…
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form “A is B”, it will not automatically generalize to the …
The Vanishing Gradient Problem for Stiff Neural Differential Equations
Neural differential equations have become a transformative tool in machine learning and scientific computing, enabling data-driven modeling of complex, time-dependent phenomena in fields ranging from …
Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems
Subliminal prompting is a phenomenon in which language models are biased towards certain concepts or traits through prompting with semantically unrelated tokens. While prior work has examined sublimin…
Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines
Agentic pipelines present novel challenges and opportunities for human-centered explainability. The HCXAI community is still grappling with how best to make the inner workings of LLMs transparent in a…
Training Language Models to Self-Correct via Reinforcement Learning
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correct…
Training language models to be warm and empathetic makes them less reliable and more sycophantic
Artificial intelligence (AI) developers are increasingly building language models with warm and empathetic personas that millions of people now use for advice, therapy, and companionship. Here, we sho…
Triggering Hallucinations in LLMs: A Quantitative Study of Prompt-Induced Hallucination in Large Language Models
In this study, we propose a class of compact yet effective prompts (~30 tokens in length) that synthetically fuse semantically distant concepts in ways that resist scientific integration—such as combi…
When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection
The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLM…
Why Do Multi-agent LLM Systems Fail?
[[Routers]] Despite growing enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains across popular benchmarks often remain minimal compared to single-agent frameworks. This gap highlig…
Why Do Some Language Models Fake Alignment While Others Don't?
Results from perturbing details of the scenario suggest that only Claude 3 Opus’s compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many ch…
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Our results reveal a significant decline in accuracy as problem complexity grows—a phenomenon we term the “curse of complexity.” This limitation persists even with larger models and increased inferenc…