Do retrieval models actually follow natural language instructions?
Most IR systems ignore instructions that define relevance, despite using LLM backbones. This raises questions about whether retrievers can adapt to nuanced user-specified information needs in practice.
LLMs follow long, complex instructions, and IR models increasingly use LLM backbones — yet nearly all retrievers still take only a query, with no instruction defining what relevance means for this task. FollowIR builds a benchmark from the TREC tradition (where human annotators receive narratives — detailed instructions — to judge relevance) by altering those annotator instructions and re-annotating, then measuring whether IR models adjust their relevance decisions accordingly. The finding: nearly all retrieval models do not follow instructions, with exceptions only for very large (3B+) or instruction-tuned LLMs not typically used for retrieval. But it is learnable — a training corpus teaches instruction-following, and FollowIR-7B improves on both standard retrieval metrics and instruction-following.
The keeper is that retrieval is stuck in an ad-hoc keyword paradigm while the rest of NLP moved to flexible instructions: relevance is treated as a fixed property of the query rather than something an instruction can redefine on the fly. Closing that gap would let users specify complex information needs in natural language.
This connects the vault's retrieval thread to instruction-following. It complements Can question features alone predict when to retrieve? (when to retrieve) by addressing what counts as relevant — both are limits of the query-only retrieval paradigm the RAG-gap note diagnoses.
Inquiring lines that use this note as a source 3
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can question features alone predict when to retrieve?
Can lightweight external features of a question—rather than expensive model uncertainty checks—reliably decide whether retrieval is needed? This matters because uncertainty-based methods promise efficiency but add computation.
when-to-retrieve vs what-counts-as-relevant; both limits of query-only retrieval
-
Why does retrieval-augmented generation fail in production?
RAG systems work in controlled demos but break in real-world deployment, especially for high-stakes domains like medicine and finance. Understanding the three structural failure modes reveals why.
instruction-blind relevance is part of why production RAG underperforms
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
- On the Theoretical Limitations of Embedding-Based Retrieval
- Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning
- Do Prompt-Based Models Really Understand the Meaning of Their Prompts?
- Instruction Tuning for Large Language Models: A Survey
- Answer is All You Need: Instruction-following Text Embedding via Answering the Question
- Exploring Format Consistency for Instruction Tuning
- Rethinking with Retrieval: Faithful Large Language Model Inference
Original note title
retrieval models do not follow natural-language instructions defining relevance — only very large or instruction-tuned ones do