How should practitioners measure similarity between embeddings safely?
This reads the question as: when you compute a similarity score between two embeddings, what can quietly go wrong, and what does the corpus suggest about doing it responsibly?
This explores the practical hazards of treating an embedding similarity score as if it measures what you think it measures — and the corpus is unusually pointed on the subject. The first thing to absorb is that the most common tool, cosine similarity, can be arbitrary. A closed-form analysis of regularized linear models shows cosine scores between learned embeddings are not unique: they shift depending on regularization choices made during training, not on any underlying semantic structure Does cosine similarity actually measure embedding similarity?. So a cosine score isn't a stable physical quantity you read off a ruler — it's an artifact partly of how the model was fit. 'Safely' starts with not over-trusting the number.
The second hazard is about what the geometry encodes even when it's stable. Embeddings capture semantic association — co-occurrence patterns — not task relevance. That means two things that are 'close' may be close because they appear in similar contexts, while being exactly the wrong answer for your actual query Do vector embeddings actually measure task relevance?. This is why a similarity demo dazzles and the production system disappoints: real queries are underspecified and full of plausible-but-wrong neighbors. A safe practitioner asks 'similar in what sense?' before shipping the score as relevance.
There's also a structural ceiling worth knowing. Communication-complexity theory proves that for any fixed embedding dimension there's a hard limit on how many distinct top-k document combinations the geometry can ever return — a limit that holds even for embeddings optimized directly on the test set, demonstrated on trivially simple tasks Do embedding dimensions fundamentally limit retrievable document combinations?. So some retrieval failures aren't tuning problems; they're baked into the dimensionality. No similarity metric rescues you from that.
Where the corpus gets constructive is in what to reach for instead, or alongside. Counterintuitively, the safe similarity function is often the simple one: carefully tuned dot products beat learned MLP 'similarity' networks despite the MLP's theoretical universality, because the dot product's inductive bias matters more than raw expressiveness — and dot products retrieve efficiently at scale via MIPS, which learned similarities can't Why does dot product beat MLP-based similarity in practice? Can MLPs learn to match dot product similarity in practice?. Complexity is a liability here, not a safeguard. And when the question is relational rather than associative — multi-hop, aggregate, 'which items connect through X' — graph traversal gives deterministic, complete answers where probabilistic vector search silently drops results When do graph databases outperform vector embeddings for retrieval?.
The most interesting move is to route around raw embedding comparison entirely. Describing an unknown image in natural language and then retrieving over a text index outperforms direct embedding similarity for zero-shot recognition, because language bridges the reference gap more faithfully than vector proximity Can describing images in text improve zero-shot recognition?. Similarly, mapping text to discrete codes deliberately decouples representation from text-similarity bias, so a recommender stops inheriting the quirks of the text encoder Can discretizing text embeddings improve recommendation transfer?. The throughline: measuring embedding similarity 'safely' means knowing the score is a fitted artifact, distinguishing association from relevance, respecting the dimensional ceiling, preferring simple well-understood metrics, and being willing to swap the whole comparison for language or graph structure when proximity is the wrong instrument.
Sources 8 notes
Regularized linear models with closed-form solutions show that cosine similarities between embeddings are not unique and depend on regularization choices made during training, not on actual semantic structure. This makes cosine scores unstable and potentially meaningless.
Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.
Communication complexity theory proves that for any embedding dimension d, there exists a maximum number of top-k document combinations that can be returned as results. Even embeddings optimized directly on test data hit this polynomial limit, demonstrated on trivially simple retrieval tasks.
Rendle et al. show properly-tuned dot products substantially beat MLP-based similarity despite MLP universality. Learning a dot product with an MLP requires large models and datasets; dot products also enable efficient retrieval at production scale through MIPS algorithms.
Rendle et al. show that carefully tuned dot products substantially outperform learned MLP similarities in collaborative filtering. MLPs require excessive capacity and data to match simple geometric similarity, and they cannot be efficiently retrieved at scale—proving inductive bias matters more than expressiveness.
Graph-oriented databases solve vector similarity's failure on aggregate queries by replacing probabilistic similarity search with deterministic graph traversal via Cypher. The tradeoff: higher construction cost but precision and completeness for enterprise use cases where query patterns are relational.
SignRAG demonstrates that describing an unknown image via vision-language model, then retrieving known designs from a text-indexed database, eliminates the need for recognition model training. Natural-language description bridges the visual-reference gap better than direct embedding similarity.
VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.