Can semantic query expansion overcome vocabulary mismatch in corrupted text?

This explores whether broadening a search query with semantically related terms can recover the right documents when the source text itself is degraded — OCR errors, language drift, garbled tokens — so the query and the document no longer share surface vocabulary.

This explores whether broadening a search query with related terms can rescue retrieval when the underlying text is corrupted — and the corpus suggests the answer is a qualified yes, but only when expansion is paired with a discipline on what the system is allowed to say afterward. The clearest case study is a multilingual RAG system built for noisy historical newspapers, where OCR garble and language drift mean the query word and the printed word rarely match exactly. Its solution is two-sided: aggressively expand retrieval to cast a wider net, then aggressively constrain generation so the model only answers when it has genuinely grounded evidence Can RAG systems refuse to answer without reliable evidence?. Expansion buys you coverage against vocabulary mismatch; grounded refusal stops that wider net from dragging in noise the model then hallucinates over. The two moves are inseparable — expansion alone would amplify the corruption.

But there's a deeper reason expansion can't be the whole story, and it shows up when you ask what embeddings actually measure. Retrieval failures are architectural, not incremental: embeddings capture *association*, not *relevance*, and there are hard mathematical limits on how many distinct documents a given embedding dimension can even represent Where do retrieval systems fail and why?. Semantic expansion works inside the embedding space, so it inherits that space's blind spots. The most pointed example is the gap between surface similarity and what you actually want: a student asks about 'projection' after a specific remark, but the semantically nearest passage is the one about projection *matrices* — the wrong one Why do queries and their causes seem semantically different?. Expanding the query with more 'projection'-adjacent terms makes that confusion worse, not better, because the mismatch isn't lexical — it's about cause versus topic.

The adjacent lesson the corpus offers is that the heavy lifting against mismatch may belong at the *verification* stage rather than the expansion stage. A two-stage pipeline that first recalls broadly, then runs a small learned verifier over full token-to-token similarity maps, reliably rejects 'structural near-misses' — documents that look right in compressed vector form but fall apart under close token-level inspection Can verification separate structural near-misses from topical matches?. That's the same shape as grounded refusal: widen recall, then add a second, stricter gate. For corrupted text specifically, you can also attack the problem from the model side — domain descriptions alone can be used to synthesize training data and adapt a retriever to a degraded or unusual corpus without ever touching the target collection Can you adapt retrieval models without accessing target data?, which addresses mismatch by teaching the retriever the corrupted domain's vocabulary directly instead of patching each query.

Two cautions worth knowing. First, corruption isn't only an input problem you retrieve around — LLMs themselves silently corrupt about 25% of document content over long delegated workflows, with errors compounding rather than plateauing Do frontier LLMs silently corrupt documents in long workflows?. So a pipeline that expands queries to handle noisy sources can quietly manufacture new noise downstream. Second, expansion presumes the system knows what it's looking for, but models are strikingly bad at recognizing when text supports multiple readings — GPT-4 disambiguates only 32% of genuinely ambiguous cases versus 90% for humans Can language models recognize when text is deliberately ambiguous?. Corrupted text is ambiguous text, and a system that can't tell it's facing several plausible interpretations will expand confidently in one wrong direction. The takeaway you didn't come looking for: semantic expansion is real leverage against vocabulary mismatch, but every paper that makes it work pairs it with a downstream skeptic — a refusal rule, a verifier, an entailment check — because the same breadth that overcomes mismatch is also what lets corruption through.

Sources 7 notes

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

Why do queries and their causes seem semantically different?

Backtracing—finding what caused a query—diverges from semantic similarity especially in conversation and lecture domains. Students ask about projection after hearing a specific statement, but the semantically closest passage discusses projection matrices instead, showing that surface similarity misses the actual cause.

Can verification separate structural near-misses from topical matches?

A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.

Can you adapt retrieval models without accessing target data?

Research demonstrates that a brief textual domain description suffices to generate synthetic training data for retrieval fine-tuning, outperforming baselines in zero-target-access scenarios and enabling adaptation where conventional methods are blocked.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Can language models recognize when text is deliberately ambiguous?

AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a retrieval-augmented generation researcher. The open question: can semantic query expansion reliably overcome vocabulary mismatch in corrupted text? A curated library of papers spanning 2023–2026 found:

**What a curated library found — and when (dated claims, not current truth):**
- Semantic expansion paired with grounded refusal (refusing to answer without verified evidence) rescues retrieval in noisy multilingual corpora, but expansion alone amplifies corruption (2024–2025).
- Embeddings capture association, not relevance; hard dimensional limits prevent disambiguating causal relevance from surface similarity—e.g., 'projection' queries retrieve matrix docs instead of contextual ones. Token-level verification gates can reject these structural near-misses (2024).
- Domain-adaptive retrieval via synthetic training data (no target collection access) can teach retrievers corrupted vocabulary without touching the source (2023).
- LLMs silently corrupt ~25% of document content in long delegated workflows; corruption compounds (2026).
- GPT-4 disambiguates only 32% of genuinely ambiguous cases vs. 90% for humans (2023); corrupted text is ambiguous, and confident wrong expansion follows.
- Newer methods (2025–2026) unify RAG and reasoning via RL (UR2), continuous latent reasoning (CLaRa), and compositional sensitivity training, reframing the problem beyond query expansion alone.

**Anchor papers (verify; mind their dates):**
- arXiv:2304.14399 (2023) – Ambiguity recognition limits in LLMs
- arXiv:2403.03956 (2024) – Backtracing retrieves causal relevance
- arXiv:2604.16351 (2026) – Compositional sensitivity and retrieval generalization
- arXiv:2508.06165 (2025) – UR2: RL-driven unified RAG & reasoning

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, assess whether (a) longer-context models (2024–2026), (b) reasoning-integrated RAG (UR2, CLaRa), (c) RL-trained retrieval, or (d) compositional sensitivity training have relaxed the embedding-space blind spot or the ambiguity-recognition ceiling. Does expansion still fail on causal relevance vs. surface similarity? Has the ~25% corruption rate been independently replicated or mitigated? Separate durable (probably still open: how do you expand when the query itself is ambiguous?) from perishable (may be solved by RL or compositional training).
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Do recent RL-based or reasoning-aware approaches sidestep expansion altogether, or do they still need it as a first stage?
(3) **Propose 2 research questions** that assume the regime has moved: (a) Can compositional sensitivity in retrievers enable *adaptive* expansion—i.e., the system expands only when ambiguity is detected? (b) If RL can align retrieval and reasoning jointly, does it solve the causal-vs.-semantic gap without post-hoc verification gates?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Can semantic query expansion overcome vocabulary mismatch in corrupted text?

Sources 7 notes

Next inquiring lines