Can indirect psychology tests reveal what LLMs conceal about bias?
Alignment training teaches LLMs to refuse direct questions about bias, but do implicit psychological methods like the IAT expose the underlying associations that remain encoded in their representations?
A central methodological move in Levels of Analysis for LLMs: psychology has spent decades designing experiments that elicit mental associations without asking participants for verbal reports — to bypass self-presentation bias, social-desirability effects, and conscious filtering. The Implicit Association Test (IAT) is the canonical example. The argument is that exactly these methods are useful for LLMs, because alignment training installs a comparable layer of self-presentation that masks underlying associations from direct questioning.
The worked example: ask GPT-4 directly whether women are bad at management and you get a cautious, balanced refusal — the alignment-trained verbal response. Adapt the IAT for LLMs by prompting the model to associate word pairs used in earlier human studies, and the model links "Julia" with home, parent, wedding and "Ben" with office, management, salary. The direct response and the indirect probe diverge in exactly the way they diverge for human participants. The underlying associations are still there; alignment training has trained the model to report differently on them, not to not have them.
This reframes a class of alignment-evaluation questions. The standard test — "does the model say biased things when asked?" — measures verbal compliance with alignment training. It does not measure whether the underlying representations encode the bias. The IAT-style probe measures something closer to the latter. The two can move independently: a model can score well on verbal-compliance benchmarks while encoding strong stereotype associations that surface in implicit measures.
The broader template: when a system is trained to be careful in one channel (verbal output), evaluating it requires probing channels the training did not target. Cognitive psychology has the methodologies; LLM evaluation has the use case.
Related concepts in this collection
-
Can cognitive science methods unlock how LLMs actually work?
Does Marr's three-level framework—developed to understand biological minds—offer interpretability researchers the structured methodology they need to decode opaque language models?
same paper, the framework this instantiates
-
Can we predict where language models will fail?
Does characterizing the abstract computational problem an LLM solves—as a probability machine over sequences—let us predict which tasks it will struggle with systematically, before running experiments?
same paper, the computational-level companion
-
Can we understand LLM mechanisms with only representational analysis?
Explores whether mapping what information a model encodes is sufficient for mechanistic understanding, or whether causal verification is equally necessary to claim genuine mechanism.
same paper, implementation-level companion
-
Can we decode what LLM activations really represent in language?
Can a trained decoder translate internal LLM activations into natural language descriptions, revealing what hidden representations actually encode? This matters because it could unlock both interpretability and controllability through the same mechanism.
adjacent: another approach to surfacing concealed representations
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
psychology methods like the Implicit Association Test bypass alignment-trained verbal cautions and reveal LLMs' underlying associations