How should practitioners measure similarity between embeddings safely?

This reads the question as: when you compute a similarity score between two embeddings, what can quietly go wrong, and what does the corpus suggest about doing it responsibly?

This explores the practical hazards of treating an embedding similarity score as if it measures what you think it measures — and the corpus is unusually pointed on the subject. The first thing to absorb is that the most common tool, cosine similarity, can be arbitrary. A closed-form analysis of regularized linear models shows cosine scores between learned embeddings are not unique: they shift depending on regularization choices made during training, not on any underlying semantic structure Does cosine similarity actually measure embedding similarity?. So a cosine score isn't a stable physical quantity you read off a ruler — it's an artifact partly of how the model was fit. 'Safely' starts with not over-trusting the number.

The second hazard is about what the geometry encodes even when it's stable. Embeddings capture semantic association — co-occurrence patterns — not task relevance. That means two things that are 'close' may be close because they appear in similar contexts, while being exactly the wrong answer for your actual query Do vector embeddings actually measure task relevance?. This is why a similarity demo dazzles and the production system disappoints: real queries are underspecified and full of plausible-but-wrong neighbors. A safe practitioner asks 'similar in what sense?' before shipping the score as relevance.

There's also a structural ceiling worth knowing. Communication-complexity theory proves that for any fixed embedding dimension there's a hard limit on how many distinct top-k document combinations the geometry can ever return — a limit that holds even for embeddings optimized directly on the test set, demonstrated on trivially simple tasks Do embedding dimensions fundamentally limit retrievable document combinations?. So some retrieval failures aren't tuning problems; they're baked into the dimensionality. No similarity metric rescues you from that.

Where the corpus gets constructive is in what to reach for instead, or alongside. Counterintuitively, the safe similarity function is often the simple one: carefully tuned dot products beat learned MLP 'similarity' networks despite the MLP's theoretical universality, because the dot product's inductive bias matters more than raw expressiveness — and dot products retrieve efficiently at scale via MIPS, which learned similarities can't Why does dot product beat MLP-based similarity in practice? Can MLPs learn to match dot product similarity in practice?. Complexity is a liability here, not a safeguard. And when the question is relational rather than associative — multi-hop, aggregate, 'which items connect through X' — graph traversal gives deterministic, complete answers where probabilistic vector search silently drops results When do graph databases outperform vector embeddings for retrieval?.

The most interesting move is to route around raw embedding comparison entirely. Describing an unknown image in natural language and then retrieving over a text index outperforms direct embedding similarity for zero-shot recognition, because language bridges the reference gap more faithfully than vector proximity Can describing images in text improve zero-shot recognition?. Similarly, mapping text to discrete codes deliberately decouples representation from text-similarity bias, so a recommender stops inheriting the quirks of the text encoder Can discretizing text embeddings improve recommendation transfer?. The throughline: measuring embedding similarity 'safely' means knowing the score is a fitted artifact, distinguishing association from relevance, respecting the dimensional ceiling, preferring simple well-understood metrics, and being willing to swap the whole comparison for language or graph structure when proximity is the wrong instrument.

Sources 8 notes

Regularized linear models with closed-form solutions show that cosine similarities between embeddings are not unique and depend on regularization choices made during training, not on actual semantic structure. This makes cosine scores unstable and potentially meaningless.

Do vector embeddings actually measure task relevance?

Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.

Do embedding dimensions fundamentally limit retrievable document combinations?

Communication complexity theory proves that for any embedding dimension d, there exists a maximum number of top-k document combinations that can be returned as results. Even embeddings optimized directly on test data hit this polynomial limit, demonstrated on trivially simple retrieval tasks.

Why does dot product beat MLP-based similarity in practice?

Rendle et al. show properly-tuned dot products substantially beat MLP-based similarity despite MLP universality. Learning a dot product with an MLP requires large models and datasets; dot products also enable efficient retrieval at production scale through MIPS algorithms.

Can MLPs learn to match dot product similarity in practice?

Rendle et al. show that carefully tuned dot products substantially outperform learned MLP similarities in collaborative filtering. MLPs require excessive capacity and data to match simple geometric similarity, and they cannot be efficiently retrieved at scale—proving inductive bias matters more than expressiveness.

When do graph databases outperform vector embeddings for retrieval?

Graph-oriented databases solve vector similarity's failure on aggregate queries by replacing probabilistic similarity search with deterministic graph traversal via Cypher. The tradeoff: higher construction cost but precision and completeness for enterprise use cases where query patterns are relational.

Can describing images in text improve zero-shot recognition?

SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.

Can discretizing text embeddings improve recommendation transfer?

VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a retrieval systems researcher evaluating whether embedding similarity remains a bottleneck or liability in 2025+. The question is: How should practitioners measure similarity between embeddings safely?

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026 and include:
- Cosine similarity scores between learned embeddings are not unique artifacts; they shift with regularization choices, not semantic structure (2024-03, arXiv:2403.05440).
- Embeddings encode semantic association (co-occurrence), not task relevance, causing production retrieval to miss task-aligned neighbors while returning plausible-but-wrong hits (~2023–2024).
- Hard mathematical ceiling: any fixed embedding dimension limits the count of distinct top-k result sets, even for test-optimized embeddings — a dimensionality floor, not a tuning problem (2025-08, arXiv:2508.21038).
- Dot-product similarity outperforms learned MLP 'similarity' networks in practice despite MLP's theoretical universality; simplicity and MIPS efficiency matter more than expressiveness (2022–2023).
- Alternative pathways (language description + text retrieval, discrete codes, graph traversal, RLHF-grounded preference measurement) often outflank raw embedding similarity (2024–2026).

Anchor papers (verify; mind their dates):
- arXiv:2403.05440 (2024-03): Is Cosine-Similarity of Embeddings Really About Similarity?
- arXiv:2508.21038 (2025-08): On the Theoretical Limitations of Embedding-Based Retrieval
- arXiv:2604.03238 (2026-01): Measuring Human Preferences in RLHF is a Social Science Problem
- arXiv:2605.23821 (2026-05): Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, assess whether post-2024 scaling, fine-tuning methods (LoRA, DPO, synthetic preference data), embedding quantization, or retrieval orchestration (reranking, hybrid BM25+dense, multi-tower architectures) have relaxed the arbitrary-cosine or dimensionality ceiling. Separate durable insights (e.g., 'embeddings encode association, not relevance') from resolved limitations (e.g., 'dot-product MIPS is the only efficient option'). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — particularly any showing learned similarity functions ARE competitive at scale, or that geometric ceilings are circumvented by recent representational techniques.
(3) Propose 2 research questions that assume the regime may have shifted: one about whether foundation-model embeddings (GPT-4, Claude, Gemini-level) escape the association-vs-relevance split; one about whether synthetic preference data or RLHF retraining now lets practitioners override the dimensionality floor.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How should practitioners measure similarity between embeddings safely?

Sources 8 notes

Next inquiring lines