Can semantic query expansion overcome vocabulary mismatch in corrupted text?
This explores whether broadening a search query with semantically related terms can recover the right documents when the source text itself is degraded — OCR errors, language drift, garbled tokens — so the query and the document no longer share surface vocabulary.
This explores whether broadening a search query with related terms can rescue retrieval when the underlying text is corrupted — and the corpus suggests the answer is a qualified yes, but only when expansion is paired with a discipline on what the system is allowed to say afterward. The clearest case study is a multilingual RAG system built for noisy historical newspapers, where OCR garble and language drift mean the query word and the printed word rarely match exactly. Its solution is two-sided: aggressively expand retrieval to cast a wider net, then aggressively constrain generation so the model only answers when it has genuinely grounded evidence Can RAG systems refuse to answer without reliable evidence?. Expansion buys you coverage against vocabulary mismatch; grounded refusal stops that wider net from dragging in noise the model then hallucinates over. The two moves are inseparable — expansion alone would amplify the corruption.
But there's a deeper reason expansion can't be the whole story, and it shows up when you ask what embeddings actually measure. Retrieval failures are architectural, not incremental: embeddings capture *association*, not *relevance*, and there are hard mathematical limits on how many distinct documents a given embedding dimension can even represent Where do retrieval systems fail and why?. Semantic expansion works inside the embedding space, so it inherits that space's blind spots. The most pointed example is the gap between surface similarity and what you actually want: a student asks about 'projection' after a specific remark, but the semantically nearest passage is the one about projection *matrices* — the wrong one Why do queries and their causes seem semantically different?. Expanding the query with more 'projection'-adjacent terms makes that confusion worse, not better, because the mismatch isn't lexical — it's about cause versus topic.
The adjacent lesson the corpus offers is that the heavy lifting against mismatch may belong at the *verification* stage rather than the expansion stage. A two-stage pipeline that first recalls broadly, then runs a small learned verifier over full token-to-token similarity maps, reliably rejects 'structural near-misses' — documents that look right in compressed vector form but fall apart under close token-level inspection Can verification separate structural near-misses from topical matches?. That's the same shape as grounded refusal: widen recall, then add a second, stricter gate. For corrupted text specifically, you can also attack the problem from the model side — domain descriptions alone can be used to synthesize training data and adapt a retriever to a degraded or unusual corpus without ever touching the target collection Can you adapt retrieval models without accessing target data?, which addresses mismatch by teaching the retriever the corrupted domain's vocabulary directly instead of patching each query.
Two cautions worth knowing. First, corruption isn't only an input problem you retrieve around — LLMs themselves silently corrupt about 25% of document content over long delegated workflows, with errors compounding rather than plateauing Do frontier LLMs silently corrupt documents in long workflows?. So a pipeline that expands queries to handle noisy sources can quietly manufacture new noise downstream. Second, expansion presumes the system knows what it's looking for, but models are strikingly bad at recognizing when text supports multiple readings — GPT-4 disambiguates only 32% of genuinely ambiguous cases versus 90% for humans Can language models recognize when text is deliberately ambiguous?. Corrupted text is ambiguous text, and a system that can't tell it's facing several plausible interpretations will expand confidently in one wrong direction. The takeaway you didn't come looking for: semantic expansion is real leverage against vocabulary mismatch, but every paper that makes it work pairs it with a downstream skeptic — a refusal rule, a verifier, an entailment check — because the same breadth that overcomes mismatch is also what lets corruption through.
Sources 7 notes
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.
RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.
Backtracing—finding what caused a query—diverges from semantic similarity especially in conversation and lecture domains. Students ask about projection after hearing a specific statement, but the semantically closest passage discusses projection matrices instead, showing that surface similarity misses the actual cause.
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.
Research demonstrates that a brief textual domain description suffices to generate synthetic training data for retrieval fine-tuning, outperforming baselines in zero-target-access scenarios and enabling adaptation where conventional methods are blocked.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.