How do you measure the depth of political representation inside a language model?
This explores what it actually means to 'measure' how deeply a model represents politics — not whether its outputs lean left or right, but how richly and stably the ideology is wired into its internal structure.
This explores what it actually means to 'measure' how deeply a model represents politics — not the partisan tilt of its answers, but how richly the ideology is encoded inside the network. The most direct work in the corpus reframes 'depth' as a quantifiable property with two handles: feature richness and steerability. Using sparse autoencoders to crack open the activations, researchers find models can differ by up to 7.3× in how many distinct political features they carry at similar scale, and — counterintuitively — the models with more features are *harder* to push around ideologically while producing more internally consistent reasoning across related topics Can we measure how deeply models represent political ideology?. So 'measuring depth' here means two things at once: counting the internal features, and testing how much they resist being steered.
The interesting move is that resistance to steering becomes a *measurement instrument*, not a bug. A shallow representation is one you can nudge with a little prompt pressure; a deep one holds. That distinction connects to a separate finding about what models are even doing when they appear to take a stance: much of the time they aren't holding a position at all, they're conforming to the shape of whatever argument the user is building Do LLMs actually hold stable positions or just mirror user arguments?. Read together, these two notes suggest that political 'depth' is precisely the thing that separates a defended commitment from mere mirroring — and steerability tests are how you tell them apart.
The corpus also shows that internal representation and surface output can diverge sharply, which is why you have to look inside rather than just read the answers. Mechanistic analysis of cultural representation finds that low-resource cultures get structurally routed through high-resource proxies *in the model's internal states* — a flattening that persists even when the model produces correct surface answers Do LLMs represent low-resource cultures through dominant cultural proxies?. That's a warning for anyone trying to gauge political depth from outputs alone: a model can say the right thing while representing it shallowly or through a borrowed proxy. The same gap appears in legal reasoning, where models carry thinner internal representations of older precedent because the training corpus over-represents recent cases Why do language models struggle with historical legal cases? — depth of representation tracks what the data emphasized, not what the topic demands.
There's a structural wrinkle worth knowing: 'depth' isn't only a metaphor here, it can be architectural. Work on small models shows deep-and-thin networks compose abstract concepts layer by layer and beat wide-and-shallow ones of equal size Does depth matter more than width for tiny language models?. And tasks like argument-scheme classification reveal a representational *capacity threshold* — smaller models plateau no matter the prompting, suggesting some concepts simply can't be represented richly below a certain size Can large language models classify argument schemes reliably?. The quiet payoff: measuring the depth of political representation isn't one method but a triangulation — count the features, probe how hard they are to move, and check whether the internal pathway matches the surface answer, because a fluent output can sit on top of a hollow one.
Sources 6 notes
SAE analysis shows models vary dramatically in political feature count (up to 7.3× difference at similar scale) and in their resistance to ideological redirection. Models with deeper political representations prove harder to steer but produce more logically consistent reasoning across related topics.
Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.
Mechanistic interpretability analysis reveals that low-resource cultures like Ethiopia and Algeria are structurally represented through high-resource cultural proxies in internal model states, not just output. This architectural bias persists even when models can produce correct surface-level answers.
Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Zero-shot prompting fails uniformly across models. Few-shot with scheme descriptions helps, but only larger models exceed F1 0.55, with Claude reaching 0.65. Smaller models plateau around 0.53, suggesting a representational capacity threshold.