Knowledge Retrieval and RAG

Why do queries and documents occupy different embedding spaces?

Queries and documents express the same information in fundamentally different ways—short and interrogative versus long and declarative. Understanding this mismatch is crucial for why direct embedding retrieval often fails.

Note · 2026-02-22 · sourced from RAG
RAG How should researchers navigate LLM reasoning research?

The standard embedding retrieval pipeline maps a query directly to a vector and finds nearby document vectors. This assumes that a query and a relevant document occupy nearby regions of the embedding space. They often do not. Queries are short, telegraphic, and interrogative. Relevant documents are long, detailed, and declarative. The same information expressed in query form and document form looks different to an encoder trained on natural language co-occurrence.

HyDE (Hypothetical Document Embeddings) decomposes retrieval into two steps that exploit this asymmetry. First: ask an instruction-following LLM to generate a hypothetical document that would answer the query — not a real document, but something that looks like one. Second: embed the hypothetical document and use document-document similarity to find real corpus matches. The encoder, trained on documents-to-documents, now operates in its natural space.

The generated document may be factually wrong — it is, in the FLARE framing, a hallucination on purpose. But factual accuracy is not the goal. Relevance pattern is the goal. The hypothetical document "captures relevance by example": it demonstrates what a relevant document looks like in terms of style, terminology, and structure. The encoder's dense bottleneck filters out hallucinated details while preserving the embedding signature of relevant content.

The implication is that the query is the wrong level of abstraction for retrieval. Queries work well when they are complete enough to uniquely identify relevant content — which is why they succeed on short-form factoid QA but fail on complex or underspecified queries. Hypothetical documents circumvent this by translating the query into the same genre as the targets.

The approach requires no relevance labels and no retrieval-specific fine-tuning — only an instruction-following LLM and an unsupervised contrastive encoder. On 11 query sets spanning web search, question answering, and fact verification, HyDE with InstructGPT and Contriever significantly outperforms the zero-shot no-relevance baseline.


Source: RAG

Related concepts in this collection

Concept map
15 direct connections · 162 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

query-document vocabulary mismatch makes direct embedding retrieval suboptimal — hypothetical document bridging resolves it