Does RL pruning of documents differ fundamentally from rationale-driven evidence selection?
This explores whether two ways of trimming what a model reads — pruning learned from a reward signal versus selecting evidence by explicit LLM-written reasons — are doing fundamentally the same thing or two different things.
This explores whether RL-style document pruning and rationale-driven evidence selection are the same idea wearing different clothes, or genuinely distinct mechanisms. The corpus suggests the difference is real, and it's about *what carries the signal*: one optimizes against an outcome, the other against an explanation. In rationale-driven selection, the model writes a reason for keeping each chunk before keeping it. METEORA does exactly this — LLM-generated rationales with flagging instructions pick evidence, beating similarity re-ranking by 33% while using half the chunks, and the rationale layer also makes the system harder to fool adversarially Can rationale-driven selection beat similarity re-ranking for evidence?. The justification is the artifact. You can read why something survived.
Reward- or likelihood-driven pruning works the other way: it keeps whatever a learned signal says is useful downstream, with no obligation to explain itself. The cleanest example here is token-level pruning that ranks reasoning-chain tokens by *functional importance* — symbolic computation tokens get preserved, grammar and meta-discourse get cut first, purely because that's what keeps the likelihood (and downstream student performance) intact Which tokens in reasoning chains actually matter most?. There's no rationale; there's a measured effect on the output. Chain of Draft lands in the same family from the generation side — 92.4% of reasoning tokens turn out to serve style and documentation rather than computation, so dropping them costs nothing Can minimal reasoning chains match full explanations?. Both are 'keep what matters' methods where 'matters' is defined by impact on the answer, not by an argument.
The interesting middle case is StructRAG, which trains a router with DPO — a reinforcement-learning-flavored objective — to pick which knowledge structure (table, graph, algorithm, chunk) fits a query Can routing queries to task-matched structures improve RAG reasoning?. This is learned-from-preference selection, closer to the pruning camp in *mechanism* (optimize a routing policy against outcomes) but closer to the rationale camp in *spirit* (it's choosing based on task demands). It shows the two approaches aren't a clean binary so much as a spectrum from opaque-but-effective to interpretable-but-LLM-dependent.
Why the distinction matters cuts deeper than tidiness. Rationale-driven selection buys you auditability and robustness — and that turns out to be load-bearing, because trust signals in retrieval are easily gamed: users prefer answers with *more* citations even when the extra citations are irrelevant, treating count as a proxy for quality Do users trust citations more when there are simply more of them?. A method that can state *why* a document was kept is the natural defense against that decoupling. Reward-driven pruning, by contrast, optimizes the thing you can measure, which is great until the measurement and the goal diverge — the same failure that makes grounded refusal necessary when sources are noisy, where the system constrains generation to only evidence-backed claims rather than trusting a retrieval score Can RAG systems refuse to answer without reliable evidence?.
The twist worth taking away: 'selection' may be the wrong frame for both. Work on procedural knowledge in pretraining shows that reasoning generalizes from *broad, transferable* patterns spread across many documents, while factual recall depends on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval? — and models can reconstruct information never stated in any single document by piecing together scattered hints Can LLMs reconstruct censored knowledge from scattered training hints?. If the useful signal is distributed rather than localized in specific chunks, then both rationale-flagging *and* reward-pruning are operating on the wrong unit. The deeper question isn't which pruning method wins; it's whether picking documents at all is the right move when knowledge doesn't live in documents one at a time.
Sources 8 notes
METEORA uses LLM-generated rationales with flagging instructions to select evidence, achieving 33% better accuracy with 50% fewer chunks than similarity re-ranking across legal, financial, and academic domains. The method also improves adversarial robustness substantially.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
Analysis of 24,000 Search Arena interactions shows irrelevant citations boost user preference (β=0.273) nearly as much as relevant citations (β=0.285), indicating citation count functions as a decoupled trust heuristic.
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Language models perform out-of-context reasoning across the full training distribution, reconstructing information never explicitly stated in any single document. Experiments show models can infer city identities from scattered distance relationships and apply them downstream without in-context learning.