SYNTHESIS NOTE

Do retrieval models actually follow natural language instructions?

Most IR systems ignore instructions that define relevance, despite using LLM backbones. This raises questions about whether retrievers can adapt to nuanced user-specified information needs in practice.

Synthesis note · 2026-06-03 · sourced from Self Refinement Self Consistency Feedback

LLMs follow long, complex instructions, and IR models increasingly use LLM backbones — yet nearly all retrievers still take only a query, with no instruction defining what relevance means for this task. FollowIR builds a benchmark from the TREC tradition (where human annotators receive narratives — detailed instructions — to judge relevance) by altering those annotator instructions and re-annotating, then measuring whether IR models adjust their relevance decisions accordingly. The finding: nearly all retrieval models do not follow instructions, with exceptions only for very large (3B+) or instruction-tuned LLMs not typically used for retrieval. But it is learnable — a training corpus teaches instruction-following, and FollowIR-7B improves on both standard retrieval metrics and instruction-following.

The keeper is that retrieval is stuck in an ad-hoc keyword paradigm while the rest of NLP moved to flexible instructions: relevance is treated as a fixed property of the query rather than something an instruction can redefine on the fly. Closing that gap would let users specify complex information needs in natural language.

This connects the vault's retrieval thread to instruction-following. It complements Can question features alone predict when to retrieve? (when to retrieve) by addressing what counts as relevant — both are limits of the query-only retrieval paradigm the RAG-gap note diagnoses.

Inquiring lines that use this note as a source 3

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 110 in 2-hop network ·medium cluster Open in graph ↗

Do retrieval models actually follow natural lang… Can question features alone predict when to retrie… Why does retrieval-augmented generation fail in pr…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can question features alone predict when to retrieve? Can lightweight external features of a question—rather than expensive model uncertainty checks—reliably decide whether retrieval is needed? This matters because uncertainty-based methods promise efficiency but add computation.
when-to-retrieve vs what-counts-as-relevant; both limits of query-only retrieval
Why does retrieval-augmented generation fail in production? RAG systems work in controlled demos but break in real-world deployment, especially for high-stakes domains like medicine and finance. Understanding the three structural failure modes reveals why.
instruction-blind relevance is part of why production RAG underperforms

Do retrieval models actually follow natural language instructions?

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4