Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
we observe two qualitative phenomena in SAE training: feature occlusion (where a causally relevant concept is robustly overshadowed by even slightly higher-magnitude ones in the learned features), and feature over-splitting (where binary features split into many smaller, less interpretable features).A central problem in this area is how to disentangle internal model representations into meaningful concepts or features. If successful at scale, this research could provide significant scientific and practical value, enabling enhanced model robustness, controllability, interpretability, and debugging
key obstacle: we cannot directly evaluate the usefulness of features learned by an SAE, as we do not know the hypothetical ‘true’ features to begin with; indeed, finding them is the reason we use SAEs in the first place.