Mechanistic Interpretability
Related topics:
- A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation AnalysisIn this paper, we present a set of analyses aimed at mechanistically interpreting LMs on the task of answering simple arithmetic questions (e.g., “What is the product of 11 and 17?”). In particular, w…
- A polar coordinate system represents syntax in large language modelsOriginally formalized with symbolic representations, syntactic trees may also be effectively represented in the activations of large language models (LLMs). Indeed, a “Structural Probe” can find a sub…
- Arithmetic Without Algorithms: Language Models Solve Math With a Bag of HeuristicsDo large language models (LLMs) solve reasoning tasks by learning robust generalizable algorithms, or do they memorize training data? To investigate this question, we use arithmetic reasoning as a rep…
- AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse AutoencodersFine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed…
- Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic BiasesPretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawi…
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM ReasoningReinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well …
- Break It Down: Evidence for Structural Compositionality in Neural NetworksThough modern neural networks have achieved impressive performance in both vision and language tasks, we know little about the functions that they implement. One possibility is that neural networks im…
- Circuit Tracing: Revealing Computational Graphs in Language ModelsUnderstanding and Labeling Features We use feature visualizations similar to those shown in our previous work, Scaling Monosemanticity, in order to manually interpret and label individual features in…
- Computational structuralism: Toward a formal theory of meaning in the age of digital intelligenceThe discovery that “next-token predictor” language models can fluently produce text has important but underappreciated theoretical implications. Most notably, their success demonstrates that fully rel…
- Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training DataDescription automatically generated](file:////Users/adrianchan/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_image012.png) One way to address safety risks from large la…
- Consistency Training Helps Stop Sycophancy and JailbreaksAn LLM’s factuality and refusal training can be compromised by simple changes to a prompt. Models often adopt user beliefs (sycophancy) or satisfy inappropriate requests which are wrapped within speci…
- Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM ReasoningLarge reasoning models (LRMs) have demonstrated impressive capabilities in complex problem-solving, yet their internal reasoning mechanisms remain poorly understood. In this paper, we investigate the …
- Detecting hallucinations in large language models using semantic entropyHere we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect genera…
- Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a TimeLarge Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue…
- Do LLMs Encode Functional Importance of Reasoning Tokens?Large language models solve complex tasks by generating long reasoning chains, achieving higher accuracy at the cost of increased computational cost and reduced ability to isolate functionally relevan…
- Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?We evaluate how well Large Language Models (LLMs) latently recall and compose facts to answer multi-hop queries like “In the year Scarlett Johansson was born, the Summer Olympics were hosted in the co…
- Eliciting Latent Knowledge from Quirky Language ModelsEliciting Latent Knowledge (ELK) aims to find patterns in a neural network’s activations that robustly track the true state of the world, even in cases where the model’s output is untrusted and hard t…
- Emergent Hierarchical Reasoning In LLMs Through Reinforcement LearningReinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaq…
- Emergent Introspective Awareness in Large Language ModelsInjected “thoughts” In our first experiment, we explained to the model the possibility that “thoughts” may be artificially injected into its activations, and observed its responses on control trials …
- Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language ModelsThe growing deployment of large language models (LLMs) across diverse cultural contexts necessitates a better understanding of how the overgeneralization of less documented cultures within LLMs’ repre…
- Everything Everywhere All At Once: Llms Can In-context Learn Multiple Tasks In SuperpositionLarge Language Models (LLMs) have demonstrated remarkable in-context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computati…
- Generative Models as a Complex Systems Science: How can we make sense of large language model behavior?Coaxing out desired behavior from pretrained models, while avoiding undesirable ones, has redefined NLP and is reshaping how we interact with computers. What was once a scientific engineering discipli…
- Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of GeneralizationWe study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types…
- How Multimodal LLMs Solve Image Tasks: A Lens on Visual Grounding, Task Reasoning, and Answer DecodingMultimodal Large Language Models (MLLMs) have demonstrated strong performance across a wide range of vision-language tasks, yet their internal processing dynamics remain underexplored. In this work, w…
- How do Transformers Learn Implicit Reasoning?Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly—producing correct answers without explicitly verbalizing intermediate steps—but the underlying mechani…
- How much do language models memorize?We propose a new method for estimating how much a model “knows” about a datapoint and use it to measure the capacity of modern language models. We formally separate memorization into two components: u…
- How new data permeates LLM knowledge and how to dilute itLarge language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial…
- Improving large language models with concept-aware fine-tuningLarge language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts,…
- Inference-Time Intervention: Eliciting Truthful Answers from a Language ModelWe introduce Inference-Time Intervention (ITI), a technique designed to enhance the “truthfulness” of large language models (LLMs). ITI operates by shifting model activations during inference, followi…
- Inspecting and Editing Knowledge Representations in Language Models[[Natural Language Inference]] Neural language models (LMs) represent facts about the world described by text. Sometimes these facts derive from training data (in most LMs, a representation of the …
- Investigating task-specific prompts and sparse autoencoders for activation monitoringLanguage models can behave in unexpected and unsafe ways, and so it is valuable to monitor their outputs. Internal activations of language models encode additional information that could be useful for…
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution LensChain-of-Thought (CoT) prompting has been shown to improve Large Language Model (LLM) performance on various tasks. With this approach, LLMs appear to produce human-like reasoning steps before providi…
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-ofthoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training tech…
- Large Language Models Report Subjective Experience Under Self-Referential ProcessingLarge language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theor…
- Latent Collaboration in Multi-Agent SystemsMulti-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediatio…
- LatentQA: Teaching LLMs to Decode Activations Into Natural LanguageA LATENTQA system accepts as input an activation along with any natural language question about the activation and returns a natural language answer as output. For example, the system might accept LLM…
- Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMsExisting methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights–causing destructive interference between tasks…
- Measuring the Faithfulness of Thinking Drafts in Large Reasoning ModelsLarge Reasoning Models (LRMs) have significantly enhanced their capabilities in complex problem-solving by introducing a thinking draft that enables multipath Chain-of-Thought explorations before prod…
- Mechanisms of Introspective AwarenessRecent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept—a phenomenon termed “introspective awareness.” We i…
- Mechanistic Indicators of Understanding in Large Language ModelsAbstract: Large language models (LLMs) are often portrayed as merely imitating linguistic patterns without genuine understanding. We argue that recent findings in mechanistic interpretability (MI), th…
- Natural Emergent Misalignment From Reward Hacking In Production RlWe show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of re…
- Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning PerformanceSupervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the “seesaw phenomenon”, where indiscriminate …
- Open Problems in Mechanistic InterpretabilityRecent progress in artificial intelligence (AI) has resulted in rapidly improved AI capabilities. These capabilities are not designed by humans. Instead, they are learned by deep neural networks (Hint…
- Persistent Pre-Training Poisoning of LLMsIn this work, we study how poisoning at pre-training time can affect language model behavior, both before and after post-training alignment. While it is useful to analyze the effect of poisoning on pr…
- Progress Measures For Grokking Via Mechanistic InterpretabilityNeural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding em…
- Pushdown Layers: Encoding Recursive Structure in Transformer Language ModelsRecursion is a prominent feature of human language, and fundamentally challenging for self-attention due to the lack of an explicit recursive-state tracking mechanism. Consequently, Transformer langua…
- Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation HypothesisMuch of the excitement in modern AI is driven by the observation that scaling up existing systems leads to better performance. But does better performance necessarily imply better internal representat…
- Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic InferenceRecent studies on reasoning in language models (LMs) have sparked a debate on whether they can learn systematic inferential principles or merely exploit superficial patterns in the training data. To u…
- Representation Engineering: A Top-Down Approach to AI Transparencyhow these models work on the inside and are mostly limited to treating them as black boxes. Enhanced transparency of these models would offer numerous benefits, from a deeper understanding of their de…
- Representation biases: will we achieve complete understanding by analyzing representations?A common approach in neuroscience is to study neural representations as a means to understand a system—increasingly, by relating the neural representations to the internal representations learned by c…
- Retrieval Head Mechanistically Explains Long-Context FactualityDespite the recent progress in long-context large language models (LLMs), it remains elusive how these transformer-based language models acquire the capability to retrieve relevant information from ar…
- Scaling can lead to compositional generalizationCan neural networks systematically capture discrete, compositional task structure despite their continuous, distributed nature? The impressive capabilities of large scale neural networks suggest that …
- Schema-learning and rebinding as mechanisms of in-context learning and emergence“In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge in recent transformer-based large language models (LLMs). Yet the mechanisms that underlie it are poor…
- Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue th…
- Semantic Structure in Large Language Model EmbeddingsPsychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the …
- Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept SpaceHuman cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current reasoning models, however, are constrained to reasoning with…
- Subliminal Learning: Language models transmit behavioral traits via hidden signals in dataWe study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a “teacher” model with some trait T (su…
- The Vanishing Gradient Problem for Stiff Neural Differential EquationsNeural differential equations have become a transformative tool in machine learning and scientific computing, enabling data-driven modeling of complex, time-dependent phenomena in fields ranging from …
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking TokensLarge language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts ar…
- Thought Anchors: Which LLM Reasoning Steps Matter?We argue that analyzing reasoning traces at the sentence level is a promising approach to understanding reasoning processes. We present three complementary attribution methods: (1) a black-box method …
- Thought Communication in Multiagent CollaborationNatural language has long enabled human cooperation, but its lossy, ambiguous, and indirect nature limits the potential of collective intelligence. While machines are not subject to these constraints,…
- Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph PropertiesRecent large-scale reasoning models have achieved state-of-the-art performance on challenging mathematical benchmarks, yet the internal mechanisms underlying their success remain poorly understood. In…
- Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural NetworksAbstract—The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are diff…
- Toward understanding and preventing misalignment generalizationLarge language models like ChatGPT don’t just learn facts—they pick up on patterns of behavior. That means they can start to act like different “personas,” or types of people, based on the content the…
- Towards Monosemanticity: Decomposing Language Models With Dictionary LearningMechanistic interpretability seeks to understand neural networks by breaking them into components that are more easily understood than the whole. By understanding the function of each component, and h…
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Controlwe observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and…
- Towards Safe and Honest AI Agents with Neural Self-Other OverlapAs AI systems increasingly make critical decisions, deceptive AI poses a significant challenge to trust and safety. We present Self-Other Overlap (SOO) fine-tuning, a promising approach in AI Safety t…
- Weight-sparse transformers have interpretable circuitsFinding human-understandable circuits in language models is a central goal of the field of mechanistic interpretability. We train models to have more understandable circuits by constraining most of th…
- What Has a Foundation Model Found? Using Inductive Bias to Probe for World ModelsFoundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler’s predictions of planetary motion later led to the discovery of Newton…