Can vector embeddings measure task relevance instead of semantic similarity?
This explores whether the vectors we use to find 'similar' text can actually tell us what's *relevant to a task* — not just what's *semantically close* — and what people build when that distinction breaks.
This explores whether vector embeddings can measure task relevance instead of semantic similarity — and the short version from the corpus is that they're measuring the wrong thing by default, which is why so much work routes *around* raw embedding similarity rather than trusting it. The cleanest statement of the problem is that embeddings encode co-occurrence and semantic association, not task fit: two concepts can sit close together in vector space while playing completely different roles, so an underspecified query pulls back candidates that are 'related but wrong' Do vector embeddings actually measure task relevance?. This is reinforced from the meaning side too — static embeddings genuinely do carry rich lexical content like valence and concreteness Do transformer static embeddings actually encode semantic meaning?, and their structure even mirrors taxonomy trees coarse-to-fine Do embedding eigenvectors organize taxonomy from coarse to fine?. So the issue isn't that embeddings are empty; it's that 'rich semantic association' and 'relevant to what I'm trying to do' are simply different axes.
The interesting part is how the corpus answers the implied 'so what do we do instead?' Several lines converge on the same move: insert a task-aware decision *between* the query and the embedding lookup. StructRAG routes each query to a knowledge structure (table, graph, algorithm, catalogue) chosen for the task's demands, beating uniform similarity-based retrieval — and it grounds this explicitly in cognitive-fit theory, the idea that the right representation depends on the job Can routing queries to task-matched structures improve RAG reasoning?. In recommendation, VQ-Rec deliberately *breaks* the tight coupling between text similarity and outcomes by quantizing text into discrete codes that index learned embeddings, precisely to escape text-similarity bias Can discretizing text embeddings improve recommendation transfer?. Both are saying: don't let raw semantic proximity be the final arbiter of relevance.
A second cluster goes further and substitutes natural-language *descriptions* for vector distance entirely. SignRAG recognizes unknown images by describing them in text and retrieving from a text-indexed database, finding that description bridges the reference gap better than direct embedding similarity Can describing images in text improve zero-shot recognition?. PLUS finds that learned text summaries of a user's preferences condition reward models more effectively — and more interpretably — than preference embeddings Can text summaries beat embeddings for personalized reward models?. The pattern: when relevance is what you care about, an explicit, inspectable text representation often beats an opaque distance in vector space.
There's a deeper reason to be skeptical that similarity ever cleanly equals relevance, which the corpus surfaces almost as a warning. Models lean on statistical surface mass rather than meaning — they systematically prefer high-frequency paraphrases over equivalent rare ones Do language models really understand meaning or just surface frequency?, and instruction tuning turns out to teach output *format* far more than task *understanding* Does instruction tuning teach task understanding or output format?. If the underlying representations are tracking frequency and form, then 'similarity' inherits those biases, and dialing it up won't recover relevance.
The thing you might not have known you wanted to know: the field hasn't really tried to make embeddings *themselves* measure relevance. The winning strategies leave the embedding doing what it's good at — cheap semantic association — and bolt a task-aware router, a decoupling layer, or an explicit text intermediary on top to supply the relevance the vectors can't. Relevance, in this corpus, is an architecture decision, not a distance metric.
Sources 9 notes
Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.
SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.
PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.