Mechanistic Interpretability

Topic · 69 papers

Related topics:

A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis
In this paper, we present a set of analyses aimed at mechanistically interpreting LMs on the task of answering simple arithmetic questions (e.g., “What is the product of 11 and 17?”). In particular, w…
A polar coordinate system represents syntax in large language models
Originally formalized with symbolic representations, syntactic trees may also be effectively represented in the activations of large language models (LLMs). Indeed, a “Structural Probe” can find a sub…
Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
Do large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a rep…
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders
Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed…
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawi…
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well …
Break It Down: Evidence for Structural Compositionality in Neural Networks
Though modern neural networks have achieved impressive performance in both vision and language tasks, we know little about the functions that they implement. One possibility is that neural networks im…
Circuit Tracing: Revealing Computational Graphs in Language Models
Understanding and Labeling Features We use feature visualizations similar to those shown in our previous work, Scaling Monosemanticity, in order to manually interpret and label individual features in…
Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
The discovery that “next-token predictor” language models can fluently produce text has important but underappreciated theoretical implications. Most notably, their success demonstrates that fully rel…
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
Description automatically generated](file:////Users/adrianchan/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_image012.png) One way to address safety risks from large la…
Consistency Training Helps Stop Sycophancy and Jailbreaks
An LLM’s factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within speci…
Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
Large reasoning models (LRMs) have demonstrated impressive capabilities in complex problem-solving, yet their internal reasoning mechanisms remain poorly understood. In this paper, we investigate the …
Detecting hallucinations in large language models using semantic entropy
Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect genera…
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue…
Do LLMs Encode Functional Importance of Reasoning Tokens?
Large language models solve complex tasks by generating long reasoning chains, achieving higher accuracy at the cost of increased computational cost and reduced ability to isolate functionally relevan…
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
We evaluate how well Large Language Models (LLMs) latently recall and compose facts to answer multi-hop queries like “In the year Scarlett Johansson was born, the Summer Olympics were hosted in the co…
Eliciting Latent Knowledge from Quirky Language Models
Eliciting Latent Knowledge (ELK) aims to find patterns in a neural network’s activations that robustly track the true state of the world, even in cases where the model’s output is untrusted and hard t…
Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning
Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaq…
Emergent Introspective Awareness in Large Language Models
Injected “thoughts” In our first experiment, we explained to the model the possibility that “thoughts” may be artificially injected into its activations, and observed its responses on control trials …
Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models
The growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a better understanding of how the overgeneralization of less documented cultures within LLMs’ repre…
Everything Everywhere All At Once: Llms Can In-context Learn Multiple Tasks In Superposition
Large Language Models (LLMs) have demonstrated remarkable in-context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computati…
Generative Models as a Complex Systems Science: How can we make sense of large language model behavior?
Coaxing out desired behavior from pretrained models, while avoiding undesirable ones, has redefined NLP and is reshaping how we interact with computers. What was once a scientific engineering discipli…
Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types…
How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer Decoding
Multimodal Large Language Models (MLLMs) have demonstrated strong performance across a wide range of vision-language tasks, yet their internal processing dynamics remain underexplored. In this work, w…
How do Transformers Learn Implicit Reasoning?
Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly—producing correct answers without explicitly verbalizing intermediate steps—but the underlying mechani…
How much do language models memorize?
We propose a new method for estimating how much a model “knows” about a datapoint and use it to measure the capacity of modern language models. We formally separate memorization into two components: u…
How new data permeates LLM knowledge and how to dilute it
Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial…
Improving large language models with concept-aware fine-tuning
Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts,…
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the “truthfulness” of large language models (LLMs). ITI operates by shifting model activations during inference, followi…
Inspecting and Editing Knowledge Representations in Language Models
[[Natural Language Inference]] Neural language models (LMs) represent facts about the world described by text. Sometimes these facts derive from training data (in most LMs, a representation of the …
Investigating task-specific prompts and sparse autoencoders for activation monitoring
Language models can behave in unexpected and unsafe ways, and so it is valuable to monitor their outputs. Internal activations of language models encode additional information that could be useful for…
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
Chain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providi…
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-ofthoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training tech…
Large Language Models Report Subjective Experience Under Self-Referential Processing
Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theor…
Latent Collaboration in Multi-Agent Systems
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediatio…
LatentQA: Teaching LLMs to Decode Activations Into Natural Language
A LATENTQA system accepts as input an activation along with any natural language question about the activation and returns a natural language answer as output. For example, the system might accept LLM…
Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights–causing destructive interference between tasks…
Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models
Large Reasoning Models (LRMs) have significantly enhanced their capabilities in complex problem-solving by introducing a thinking draft that enables multipath Chain-of-Thought explorations before prod…
Mechanisms of Introspective Awareness
Recent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept—a phenomenon termed “introspective awareness.” We i…
Mechanistic Indicators of Understanding in Large Language Models
Abstract: Large language models (LLMs) are often portrayed as merely imitating linguistic patterns without genuine understanding. We argue that recent findings in mechanistic interpretability (MI), th…
Natural Emergent Misalignment From Reward Hacking In Production Rl
We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of re…
Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance
Supervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the “seesaw phenomenon”, where indiscriminate …
Open Problems in Mechanistic Interpretability
Recent progress in artificial intelligence (AI) has resulted in rapidly improved AI capabilities. These capabilities are not designed by humans. Instead, they are learned by deep neural networks (Hint…
Persistent Pre-Training Poisoning of LLMs
In this work, we study how poisoning at pre-training time can affect language model behavior, both before and after post-training alignment. While it is useful to analyze the effect of poisoning on pr…
Progress Measures For Grokking Via Mechanistic Interpretability
Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding em…
Pushdown Layers: Encoding Recursive Structure in Transformer Language Models
Recursion is a prominent feature of human language, and fundamentally challenging for self-attention due to the lack of an explicit recursive-state tracking mechanism. Consequently, Transformer langua…
Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis
Much of the excitement in modern AI is driven by the observation that scaling up existing systems leads to better performance. But does better performance necessarily imply better internal representat…
Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference
Recent studies on reasoning in language models (LMs) have sparked a debate on whether they can learn systematic inferential principles or merely exploit superficial patterns in the training data. To u…
Representation Engineering: A Top-Down Approach to AI Transparency
how these models work on the inside and are mostly limited to treating them as black boxes. Enhanced transparency of these models would offer numerous benefits, from a deeper understanding of their de…
Representation biases: will we achieve complete understanding by analyzing representations?
A common approach in neuroscience is to study neural representations as a means to understand a system—increasingly, by relating the neural representations to the internal representations learned by c…
Retrieval Head Mechanistically Explains Long-Context Factuality
Despite the recent progress in long-context large language models (LLMs), it remains elusive how these transformer-based language models acquire the capability to retrieve relevant information from ar…
Scaling can lead to compositional generalization
Can neural networks systematically capture discrete, compositional task structure despite their continuous, distributed nature? The impressive capabilities of large scale neural networks suggest that …
Schema-learning and rebinding as mechanisms of in-context learning and emergence
“In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge in recent transformer-based large language models (LLMs). Yet the mechanisms that underlie it are poor…
Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue th…
Semantic Structure in Large Language Model Embeddings
Psychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the …
Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current reasoning models, however, are constrained to reasoning with…
Subliminal Learning: Language models transmit behavioral traits via hidden signals in data
We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a “teacher” model with some trait T (su…
The Vanishing Gradient Problem for Stiff Neural Differential Equations
Neural differential equations have become a transformative tool in machine learning and scientific computing, enabling data-driven modeling of complex, time-dependent phenomena in fields ranging from …
Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts ar…
Thought Anchors: Which LLM Reasoning Steps Matter?
We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method …
Thought Communication in Multiagent Collaboration
Natural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints,…
Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties
Recent large-scale reasoning models have achieved state-of-the-art performance on challenging mathematical benchmarks, yet the internal mechanisms underlying their success remain poorly understood. In…
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
Abstract—The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are diff…
Toward understanding and preventing misalignment generalization
Large language models like ChatGPT don’t just learn facts—they pick up on patterns of behavior. That means they can start to act like different “personas,” or types of people, based on the content the…
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Mechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and h…
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
we observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and…
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
As AI systems increasingly make critical decisions, deceptive AI poses a significant challenge to trust and safety. We present Self-Other Overlap (SOO) fine-tuning, a promising approach in AI Safety t…
Weight-sparse transformers have interpretable circuits
Finding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of th…
What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models
Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler’s predictions of planetary motion later led to the discovery of Newton…