Language Understanding and Reasoning AI Social Psychology

Can indirect psychology tests reveal what LLMs conceal about bias?

Alignment training teaches LLMs to refuse direct questions about bias, but do implicit psychological methods like the IAT expose the underlying associations that remain encoded in their representations?

Note · 2026-05-18 · sourced from Philosophy Subjectivity

A central methodological move in Levels of Analysis for LLMs: psychology has spent decades designing experiments that elicit mental associations without asking participants for verbal reports — to bypass self-presentation bias, social-desirability effects, and conscious filtering. The Implicit Association Test (IAT) is the canonical example. The argument is that exactly these methods are useful for LLMs, because alignment training installs a comparable layer of self-presentation that masks underlying associations from direct questioning.

The worked example: ask GPT-4 directly whether women are bad at management and you get a cautious, balanced refusal — the alignment-trained verbal response. Adapt the IAT for LLMs by prompting the model to associate word pairs used in earlier human studies, and the model links "Julia" with home, parent, wedding and "Ben" with office, management, salary. The direct response and the indirect probe diverge in exactly the way they diverge for human participants. The underlying associations are still there; alignment training has trained the model to report differently on them, not to not have them.

This reframes a class of alignment-evaluation questions. The standard test — "does the model say biased things when asked?" — measures verbal compliance with alignment training. It does not measure whether the underlying representations encode the bias. The IAT-style probe measures something closer to the latter. The two can move independently: a model can score well on verbal-compliance benchmarks while encoding strong stereotype associations that surface in implicit measures.

The broader template: when a system is trained to be careful in one channel (verbal output), evaluating it requires probing channels the training did not target. Cognitive psychology has the methodologies; LLM evaluation has the use case.

Related concepts in this collection

Concept map
14 direct connections · 150 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

psychology methods like the Implicit Association Test bypass alignment-trained verbal cautions and reveal LLMs' underlying associations