Why do embeddings measure association instead of actual task relevance?

This explores why text embeddings cluster things by what tends to appear together ('association') rather than by what actually matters for the job you're using them for ('task relevance') — and what the corpus says about why that gap exists and how to close it.

This explores why embeddings cluster by association rather than task relevance. The short version from the corpus: embeddings are built from co-occurrence statistics, and co-occurrence is not the same thing as usefulness. An embedding learns that two words or items sit near each other in text, so concepts that are *semantically* close but play *different roles* end up looking nearly identical Do vector embeddings actually measure task relevance?. That's fine in a clean demo, but in production an underspecified query surfaces a crowd of wrong-but-associated candidates — the embedding can't tell the merely-related from the actually-relevant because relatedness is the only signal it ever encoded.

There's a deeper, almost unsettling result underneath this: even the *similarity score* you read off embeddings may not mean what you think. Work on cosine similarity shows that for regularized linear models the cosine values aren't unique — they shift with the regularization choices made during training rather than tracking any real semantic structure Does cosine similarity actually measure embedding similarity?. So the gap isn't only 'association ≠ relevance'; it's that the geometry you're measuring association *with* can itself be an artifact of how the model was fit.

That said, embeddings aren't empty. Static transformer embeddings genuinely carry semantic content — valence, concreteness, even iconicity — before attention ever runs Do transformer static embeddings actually encode semantic meaning?, and the leading directions of embedding space organize concepts coarse-to-fine in a way that mirrors WordNet's hierarchy Do embedding eigenvectors organize taxonomy from coarse to fine?. The point isn't that embeddings know nothing — it's that what they know is taxonomic association, the shape of co-occurrence, not 'which of these answers your query.' The same statistical pull explains a sibling failure: models often ignore their own context because strong training-time associations override the information in front of them Why do language models ignore information in their context?, and you can even predict that priming from a keyword's pre-training probability Can we predict keyword priming before learning happens?. Association is the gravity well; relevance is something you have to add on top.

The interesting move in the corpus is how people route *around* raw embedding similarity. Recommenders quantize item text into discrete codes so the recommendation stops inheriting text-similarity bias and can adapt per domain Can discretizing text embeddings improve recommendation transfer?. Personalization gets better when you replace preference *vectors* with text *summaries* that capture dimensions embeddings miss — and stay human-readable Can text summaries beat embeddings for personalized reward models?. Zero-shot vision retrieval works better when you describe an image in natural language and retrieve against text than when you match embeddings directly Can describing images in text improve zero-shot recognition?. The common thread: each adds an explicit, task-shaped representation between the embedding and the decision, instead of trusting proximity to stand in for relevance.

The thing you didn't know you wanted to know: this is the same problem that shows up in human annotation. Labels decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences — and treating them as one uniform signal quietly contaminates whatever you train on them Do all annotation responses measure the same underlying thing?. Whether it's a cosine score or a crowd-worker's click, the failure mode is identical: a single number gets read as 'relevance' when it's actually a blend of associations measuring several different things at once.

Sources 10 notes

Do vector embeddings actually measure task relevance?

Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.

Regularized linear models with closed-form solutions show that cosine similarities between embeddings are not unique and depend on regularization choices made during training, not on actual semantic structure. This makes cosine scores unstable and potentially meaningless.

Do transformer static embeddings actually encode semantic meaning?

Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.

Do embedding eigenvectors organize taxonomy from coarse to fine?

Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can discretizing text embeddings improve recommendation transfer?

VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.

Can text summaries beat embeddings for personalized reward models?

PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.

Can describing images in text improve zero-shot recognition?

SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Why do embeddings measure association instead of actual task relevance?

Sources 10 notes

Next inquiring lines