Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words

Paper · arXiv 2205.05092 · Published May 10, 2022
FlawsLLM Architecture

we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgements, cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts, even after controlling for polysemy and other factors. We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words and provide a formal argument for the two-dimensional case.

our work asks: how does frequency impact the semantic similarity of high frequency words?

We conjecture that word frequency induces such distortions via differences in the representational geometry. We introduce new methods for characterizing geometric properties of a word’s representation in contextual embedding space, and offer a formal argument for why differences in representational geometry affect cosine similarity measurement in the two-dimensional case.

pairs of words in context, labeled as having the same or different meaning:

• same meaning: “I try to avoid the company of gamblers” and “We avoided the ball”

• different meaning: “You must carry your camping gear” and “Sound carries well over water”.

Examining the errors as a function of frequency reveals that cosine similarity is a less reliable predictor of human similarity judgements for common terms. Figure 2 shows the average proportion of examples predicted to be the same meaning as a function of frequency, grouped into ten bins, each with the same number of examples.

While the full details of our Validator are proprietary, we give an example here to illustrate the general idea. The HubSpot API has two kinds of retrieval queries: retrieving properties which can be done in a single API call, and retrieving associations which require two API calls. For example, a deal’s amount and closing date are properties, but notes are separate objects associated with the deal. LLMs frequently confuse these two operations, wrongly treating associations as properties. Our Validator identifies this error and provides the following feedback to the LLM: “Hubspot needs you to search for associated resource first and use its deal id as the associated resource id in your second query. Break into two steps and do variable injection”. If the LLM makes the same error again, the Validator will keep repeating its feedback until the error is fixed – hence the loop in figure 2.

3.3.2 Relation to other approaches

We contrast our approach with another set of approaches called “Agent-Critic Systems” or “Agentic Workflows”. In these systems, it is common to have a primary LLM “agent” receive feedback from other LLMs who act as “critics” ([6, 7, 8]). These additional LLMs not only increase time and cost, but suffer from high error rates defeating the very purpose of a critic. In other words, we do not believe LLMs are broadly capable of self-critique, which is exemplified by the limited accuracy of agentic systems in benchmarks like SWE-Bench1. as well as real-world tasks. ([9, 10]).

Our approach differs from these in that we employ an entirely static critic. Static critics require domain knowledge and some engineering effort to develop, but they have the advantage of being extremely fast and perfectly accurate. The use of static DEVs is possible because the errors made by LLMs in function calling tend to be highly repetitive and predictable. Our approach bears some resemblance to the LLM-modulo approaches proposed in [11]. However, rather than rely on external verifiers that must cover every possible instance, we develop our own verifiers to correct the most common errors committed by the LLM agent. By focusing on these common error patterns, we can achieve significant improvements in accuracy and reliability.