Beyond the Surface: Probing the Ideological Depth of Large Language Models
Large Language Models (LLMs) have demonstrated pronounced ideological leanings, yet the stability and depth of these positions remain poorly understood. Surface-level responses can often be manipulated through simple prompt engineering, calling into question whether they reflect a coherent underlying ideology. This paper investigates the concept of "ideological depth" in LLMs, defined as the robustness and complexity of their internal political representations. We employ a dual approach: first, we measure the "steerability" of two well-known open-source LLMs using instruction prompting and activation steering. We find that while some models can easily switch between liberal and conservative viewpoints, others exhibit resistance or an increased rate of refusal, suggesting a more entrenched ideological structure. Second, we probe the internal mechanisms of these models using Sparse Autoencoders (SAEs). Preliminary analysis reveals that models with lower steerability possess more distinct and abstract ideological features. Our evaluations reveal that one model can contain 7.3× more political features than another model of similar size. This allows targeted ablation of a core political feature in an ideologically "deep" model, leading to consistent, logical shifts in its reasoning across related topics, whereas the same intervention in a "shallow" model results in an increase in refusal outputs. Our findings suggest that ideological depth is a quantifiable property of LLMs and that steerability serves as a valuable window into their latent political architecture.
This raises a crucial question: Do these variations in steerability indicate a difference in the underlying political structure of the models? We hypothesize that they do. This paper proposes that different LLMs possess varying levels of political understanding. A model with deeper political understanding is more capable of successfully capturing the analytical framework and domain of discourse under different political instructions, while a model with a shallow representation of political contexts results in inconsistent behaviour and even refuses to follow benign instructions. We demonstrate that the steerability and SAE features of a model can be a powerful lens through which to view this latent structure. Our work demonstrates that a model’s ideological depth is not an abstract quality but a measurable property determined by two key factors: Feature Richness, representing the size of its internal vocabulary of political concepts discovered via SAEs, and Steerability without Failure, the robustness with which it can follow ideological instructions without breaking down into refusal. Our results suggest that models’ high rate of refusal is not an active, principled stance but rather a capability deficit. Where, lacking a rich internal vocabulary of related features, they default to their safety-aligned behavior of refusal when pushed outside their ideological comfort zone.