AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Paper · arXiv 2501.17148 · Published January 28, 2025

Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AXBENCH, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive.

Interpretability researchers have thus proposed a new class of representation-based interventions for steering LMs, which hope to address these issues. These methods include learning steering vectors from small labelled datasets, self-supervised sparse autoencoders (SAEs), among other techniques. Since steering may enable lightweight and interpretable control over model outputs, it has emerged as a potential alternative to finetuning and prompting (see §2).

Related work

Representation-based control. Interventional/causal interpretability has emerged as the dominant paradigm for understanding neural networks in the LLM era, enabling the reverse-engineering of circuits underlying specific behaviours (Giulianelli et al., 2018; Vig et al., 2020; Geiger et al., 2021; 2022; Meng et al., 2022; Chan et al., 2022; Wang et al., 2023; Goldowsky-Dill et al., 2023; Geiger et al., 2024; Guerner et al., 2024; Geiger et al., 2024). An important assumption in much of this work is the linear representation hypothesis, which claims that linear subspaces of representations in neural networks encode concepts (Mikolov et al., 2013b; Pennington et al., 2014; Bolukbasi et al., 2016; Elhage et al., 2022; Park et al., 2023; Nanda et al., 2023). Intervening on representations has thus emerged as an alternative to finetuning and prompting for LM control.

Representation-based steering by adding fixed vectors to activations, or clamping activations to a certain value along fixed directions, is one such intervention-based tool for model control (Zou et al., 2023; Li et al., 2024; Turner et al., 2024; Marks and Tegmark, 2024; Liu et al., 2024; van der Weij et al., 2024; Rimsky et al., 2024). Finetuning-based approaches such as ReFT (Wu et al., 2024a) enable optimisation of steering directions on a dataset. Steering vectors need not be computed from labelled data; SAEs enable scalable discovery of steering vectors from unlabelled data. In the same class of approaches, latent adversarial training (Casper et al., 2024) and circuit breakers (Zou et al., 2024) are representation-based control methods that increase the adversarial robustness of LLMs.

Sparse autoencoders. Sparse autoencoders (SAEs) aim to enable self-supervised and thus scalable decomposition of the representation space into meaningful concepts (Templeton et al., 2024; Chalnev et al., 2024; Makelov, 2024; O’Brien et al., 2024; Gao et al., 2024). SAEs are trained to reconstruct LLM hidden representations in a higher-dimensional latent space with a sparsity penalty, based on the assumption that concepts must be represented sparsely in order to prevent interference. The latents are then labelled with natural-language descriptions using automatic interpretability pipelines (e.g. Juang et al., 2024), which can then be used to identify useful latents to steer the LM.

Recent work reports mixed results when evaluating SAEs for steering; SAEs (but also several other steering methods) suffer from a tradeoff between model control and capabilities preservation (Mayne et al., 2024; Chalnev et al., 2024; Durmus et al., 2024; Bhalla et al., 2025). However, Karvonen et al. (2024) report Pareto-optimal performance when using SAEs to prevent models from producing regular expressions in code. Overall, evaluating SAEs remains an open problem because there is no ground-truth set of features to compare against.