Why do queries and documents occupy different embedding spaces?
Queries and documents express the same information in fundamentally different ways—short and interrogative versus long and declarative. Understanding this mismatch is crucial for why direct embedding retrieval often fails.
The standard embedding retrieval pipeline maps a query directly to a vector and finds nearby document vectors. This assumes that a query and a relevant document occupy nearby regions of the embedding space. They often do not. Queries are short, telegraphic, and interrogative. Relevant documents are long, detailed, and declarative. The same information expressed in query form and document form looks different to an encoder trained on natural language co-occurrence.
HyDE (Hypothetical Document Embeddings) decomposes retrieval into two steps that exploit this asymmetry. First: ask an instruction-following LLM to generate a hypothetical document that would answer the query — not a real document, but something that looks like one. Second: embed the hypothetical document and use document-document similarity to find real corpus matches. The encoder, trained on documents-to-documents, now operates in its natural space.
The generated document may be factually wrong — it is, in the FLARE framing, a hallucination on purpose. But factual accuracy is not the goal. Relevance pattern is the goal. The hypothetical document "captures relevance by example": it demonstrates what a relevant document looks like in terms of style, terminology, and structure. The encoder's dense bottleneck filters out hallucinated details while preserving the embedding signature of relevant content.
The implication is that the query is the wrong level of abstraction for retrieval. Queries work well when they are complete enough to uniquely identify relevant content — which is why they succeed on short-form factoid QA but fail on complex or underspecified queries. Hypothetical documents circumvent this by translating the query into the same genre as the targets.
The approach requires no relevance labels and no retrieval-specific fine-tuning — only an instruction-following LLM and an unsupervised contrastive encoder. On 11 query sets spanning web search, question answering, and fact verification, HyDE with InstructGPT and Contriever significantly outperforms the zero-shot no-relevance baseline.
Source: RAG
Related concepts in this collection
-
Do language models actually build shared understanding in conversation?
When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
the grounding gap in dialogue; HyDE is an example of building common ground in retrieval by generating an intermediate representation
-
Can prompt optimization teach models knowledge they lack?
Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.
HyDE works because the LLM already has enough knowledge to write a plausible answer; the generation activates a latent representation useful for retrieval
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
query-document vocabulary mismatch makes direct embedding retrieval suboptimal — hypothetical document bridging resolves it