Why do LLMs give unrealistic survey responses?
Direct numerical elicitation from language models produces skewed, over-positive survey distributions. Is this a fundamental model limitation, or an artifact of how we ask the question?
Asking an LLM directly for a numerical rating produces unrealistic, skewed response distributions — the documented failure of synthetic-consumer panels. Semantic Similarity Rating (SSR) changes the elicitation, not the model: prompt for a free-text response, then map it to a Likert distribution via embedding similarity to a set of reference statements. On an extensive dataset — 57 personal-care product surveys, 9,300 human responses — SSR reaches 90% of human test-retest reliability with realistic distributions (KS similarity > 0.85) and yields rich qualitative rationales, all with no fine-tuning.
The keeper is diagnostic: the well-known pathologies of LLM-as-survey-respondent — skewed distributions, over-positivity, regression-to-the-mean — are artifacts of how responses are elicited, not intrinsic limitations of the model. Shift from direct numerical elicitation to textual elicitation plus similarity mapping and the artifacts largely dissolve. This relocates the problem from "LLMs can't simulate consumers" to "we were asking the question wrong."
This sharpens the persona-simulation cluster's central tension. Since Can AI agents learn people better from interviews than surveys? shows fidelity rises with richer input, SSR shows fidelity also rises with a better output elicitation channel — both are measurement-design wins. But the caution from Can AI personas reliably replicate human experiment results? still applies: high aggregate fidelity can coexist with unreliable fine-grained effects, so SSR's realism is a measurement improvement, not a guarantee of validity.
Inquiring lines that use this note as a source 3
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can AI agents learn people better from interviews than surveys?
Can rich interview transcripts seed more accurate generative agents than demographic data or survey responses? This matters because it challenges how we build digital simulations of real people.
richer input raises fidelity; SSR shows better output elicitation does too
-
Can AI personas reliably replicate human experiment results?
Exploring whether LLM-based persona simulations accurately reproduce experimental findings from published psychology and marketing research, and what factors determine when they succeed or fail.
caution: aggregate realism can mask unreliable fine-grained effects
-
Can language models simulate belief change in people?
Current LLM social simulators treat behavior as input-output mappings without modeling internal belief formation or revision. Can they be redesigned to actually track how people think and change their minds?
SSR improves behavior-level simulation realism without addressing the thought-simulation critique
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge
- LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings
- Linguistic Calibration of Long-Form Generations
- Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
- Generalization Bias in Large Language Model Summarization of Scientific Research
- When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour
- Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries
- Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution
Original note title
LLMs simulate human survey responses faithfully only when text is elicited and mapped to scales via embedding similarity — unrealistic numerical distributions are an elicitation artifact