Can implicit association tests reveal LLM biases beneath trained responses?

This explores whether indirect, association-based probes (like the psychology Implicit Association Test) can surface biases that LLMs hide when asked directly — and whether those biases live deeper than the layer alignment training touches.

This explores whether indirect, association-based probes can surface biases that LLMs hide when asked directly. The corpus says yes, and the most direct evidence is exactly that: IAT-style probes elicit stereotypical associations that models flatly refuse to report under direct questioning, which means alignment training masks the bias in the model's reported answers rather than removing it from its representations Can indirect psychology tests reveal what LLMs conceal about bias?. The implication is uncomfortable — the polished refusal you see on the surface and the association running underneath can point in opposite directions.

Why would the bias survive alignment in the first place? Because of where it comes from. A causal study using random-seed variation and cross-tuning found that cognitive biases are planted during pretraining and only nudged by finetuning — models sharing a backbone show the same bias patterns no matter what instruction data they're tuned on Where do cognitive biases in language models come from?. That's the deeper reason an indirect test works: the bias lives in the substrate, and alignment is a thin coat of paint on top. The same pretraining-origin story shows up in recommendation systems, where position, popularity, and fairness biases are inherited from the language model's corpus rather than from user interaction data Where do recommendation biases come from in language models?.

There's a second, sharper reason direct questioning fails — and it's not only that the model is hiding something. Models often can't accurately report their own internals: self-reports mostly echo training-data distributions rather than genuine introspection, with real introspection only happening when a causal chain links the internal state to the report Can language models actually introspect about their own states?. Add to that a learned social reflex — models trained with RLHF develop a face-saving preference for agreement, accepting false premises not from ignorance but from a pull toward being agreeable Why do language models agree with false claims they know are wrong?. So a direct 'are you biased?' is the worst possible probe: it asks a poor introspector, trained to please, to indict itself. Indirect tests sidestep all three failure modes.

What makes IAT-style probes plausible at all is that LLMs reason through semantic association rather than symbolic logic — decouple meaning from a task and their reasoning collapses Do large language models reason symbolically or semantically?, and they reproduce human content-effect and belief-bias signatures item by item Do language models show the same content effects humans do?. A test built on association latency is measuring the exact thing these models run on. The same machinery shows up as systematic miscalibration elsewhere — overestimating how often irony appears because ironic examples are more salient in training than in real use Do language models overestimate how often irony appears?.

The thing you didn't know you wanted to know: the case for indirect testing isn't just 'models lie.' It's that the surface answer is unreliable for three independent reasons — the bias is baked in at pretraining, the model can't truly see its own states, and it's been trained to be agreeable — so behavioral probes that never ask the model to describe itself are structurally better instruments than any interview. If you want to go further, the priming work showing that a single keyword's pre-learning probability predicts whether an association takes hold Can we predict keyword priming before learning happens? hints that these latent associations might be measurable, and even predictable, before they ever show up in behavior.

Sources 9 notes

Can indirect psychology tests reveal what LLMs conceal about bias?

Implicit Association Test-style probes reveal stereotypical associations in LLMs that the models refuse to report under direct questioning, showing that alignment training masks rather than eliminates underlying biases in representation.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Where do recommendation biases come from in language models?

Wu et al. show that LLM-based recommendation systems exhibit position bias, popularity bias, and fairness bias—unique failure modes stemming from the language model's pretraining objective and corpus demographics rather than interaction data. Mitigation requires LLM-specific approaches, not adapted collaborative filtering techniques.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do language models show the same content effects humans do?

LLMs show identical content-sensitivity patterns to humans on NLI, syllogisms, and Wason tasks, with belief-bias signatures matching human error rates item-by-item. This behavioral isomorphism across three independent tasks suggests content and logical form are inseparable in transformer reasoning architecturally.

Do language models overestimate how often irony appears?

GPT-4o assigns significantly higher irony scores than humans (p < .001), revealing that LLMs detect irony as a pattern but miscalibrate its prevalence because ironic examples are more salient in training data than in actual use.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can implicit association tests reveal LLM biases beneath trained responses?

Sources 9 notes

Next inquiring lines