INQUIRING LINE

How well does semantic similarity preserve survey response nuance?

This explores whether mapping open-ended text answers onto numeric scales via embedding similarity keeps the richness of what people actually said — and where that translation leaks.


This explores whether mapping open-ended text answers onto numeric scales via embedding similarity keeps the richness of what people actually said. The corpus has a surprisingly direct answer on one side and a set of warnings on the other. The most on-point work is the finding that LLMs give realistic survey responses only when you change how you elicit them: instead of forcing a model to pick a number, you prompt for free text and then map that text onto a scale using embedding similarity. This "Semantic Similarity Rating" approach recovers about 90% of human test-retest reliability and makes the pathological skew and over-positivity of forced-choice answers disappear (Why do LLMs give unrealistic survey responses?). So as a measurement bridge, semantic similarity preserves a lot — the artifacts people blamed on the model turned out to be artifacts of the output channel, not lost nuance.

But "recovers 90% of reliability" is not the same as "preserves nuance," and a second thread in the corpus explains why the gap matters. Survey-style responses aren't one kind of thing: they decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, and these are only distinguishable by how consistent they are across measurement conditions (Do all annotation responses measure the same underlying thing?). A similarity score collapses all three into a single point on a scale. The number can be reliable and still erase the distinction between "I firmly believe this" and "I made this up because you asked" — which is exactly the nuance a survey often most wants to capture.

There's also a deeper reason to distrust embedding distance as a proxy for meaning. Language models systematically favor high-frequency surface phrasings over rarer paraphrases that mean the same thing (Do language models really understand meaning or just surface frequency?). If the embedding space that does your similarity scoring carries that same frequency bias, then a respondent who phrases a strong opinion in unusual words can land closer to a milder anchor simply because their wording is rarer — the geometry tracks statistical mass, not conviction. The recommender literature ran into the same trap from the other direction and built around it: VQ-Rec deliberately discretizes text into codes to break the tight coupling between surface text and downstream output, precisely to escape "text-similarity bias" (Can discretizing text embeddings improve recommendation transfer?).

The interesting move, then, is that pure semantic similarity is rarely enough on its own — the systems that work add a second axis. Temporal-aware retrieval keeps the semantic score but bolts on a separate time term, and that one addition buys up to 74% improvement on time-sensitive answers (Can retrieval systems ground answers in the right time?). The lesson generalizes to surveys: similarity is a strong base channel, but the nuance lives in the dimensions it doesn't measure — confidence, attitude-stability, the difference between a real preference and a constructed one. Reading the question's own framing matters here too, since different response types may need different handling rather than one universal mapping (Does question type determine the right retrieval strategy?).

So the honest answer is: better than the field expected, and good enough to fix the worst forced-choice artifacts — but it preserves *position* far better than it preserves *kind*. If you only need to know roughly where someone sits, semantic similarity holds up. If you need to know whether they meant it, the score alone will quietly flatten that, and you have to measure it on a separate channel.


Sources 6 notes

Why do LLMs give unrealistic survey responses?

Semantic Similarity Rating—prompting for text then mapping to scales via embeddings—achieves 90% of human test-retest reliability with realistic distributions. Pathological skew and over-positivity disappear when output channels change, proving these are measurement artifacts, not intrinsic failures.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Can discretizing text embeddings improve recommendation transfer?

VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.

Can retrieval systems ground answers in the right time?

TempRALM adds a temporal term to retrieval scoring alongside semantic similarity, achieving up to 74% improvement over baseline systems when documents have multiple time-stamped versions. The approach requires no model retraining or index changes.

Does question type determine the right retrieval strategy?

Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.

Next inquiring lines