Can we cheaply estimate which samples are currently most informative?
This explores whether there's a low-cost way to figure out which training examples (or which questions to ask) will teach a model the most *right now* — given that informativeness keeps shifting as the model learns.
This reads the question two ways at once: which *training samples* are worth learning from, and which *queries* are worth asking — and the corpus suggests both hinge on a single uncomfortable fact: informativeness isn't a property of the sample, it's a relationship between the sample and the model's current state. The sharpest statement of this is that a sample's learning value depends on the interaction between its difficulty and the model's present ability, so the 'productive band' of useful examples drifts during training and any static difficulty score goes stale within a few steps How does model ability change what samples teach?. That's the bad news for cheap estimation: whatever you measure, you have to keep re-measuring.
The good news is that several cheap proxies work surprisingly well. Gradient-based influence estimation uses low-rank gradient features to pick the 5% of instruction data most aligned with a target capability — and training on that slice beats training on everything, partly because the discarded data was actively dragging reasoning in the wrong direction Can we train better models on less data?. A related insight is that you don't always need an external estimator at all: a model's own calibrated token-probability uncertainty is a more reliable 'should I act on this?' signal than elaborate multi-call heuristics, at a fraction of the compute Can simple uncertainty estimates beat complex adaptive retrieval?. The self-knowledge is already there; you just have to read it cheaply.
On the query side, the same logic appears as active selection. Information-gain simulation scores candidate questions by how much their possible answers would shrink uncertainty, picking the genuinely high-value question instead of a generic one How can models select the most informative question to ask?. PReF pushes this to an extreme — ten adaptively chosen questions are enough to pin down a personalized reward, because each is selected to maximally reduce coefficient uncertainty Can user preferences be learned from just ten questions?. Both are 'cheap' precisely because they refuse to ask everything and instead spend the budget where uncertainty is highest. The bandit literature names the underlying tradeoff directly: explore uncertain options, exploit proven ones, and concentrate computation only on the *epistemic* uncertainty that decisions actually turn on rather than irreducible noise Can neural networks explore efficiently at recommendation scale?, Can bandit algorithms beat collaborative filtering for news?.
There's a quieter thread worth pulling: sometimes the cheapest informativeness signal is *local and partial*. Step-level confidence catches a reasoning breakdown that whole-trace averaging hides, letting you discard a bad trace before it even finishes generating Does step-level confidence outperform global averaging for trace filtering?. And a model's own half-formed answer can reveal an information gap the original query never expressed — using the partial response as the next retrieval signal Can a model's partial response guide what to retrieve next?. Informativeness, in other words, can be estimated mid-stream, not just before you start.
The thing you might not have expected: across these papers, 'cheap' and 'better' stop being a tradeoff. Curating 78 demonstrations beats ten thousand Can careful selection of 78 demos outperform massive training datasets?; 5% of data beats 100%; ten questions beat a survey. The corpus keeps finding that aggressive, uncertainty-guided selection isn't a budget compromise — it outperforms abundance, because most samples are noise or actively harmful, and the cost of estimating informativeness is far smaller than the cost of learning from the wrong things.
Sources 10 notes
A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.
LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
UoT combines uncertainty-aware scenario simulation with information-gain scoring and reward propagation to identify questions whose possible answers maximally reduce diagnostic uncertainty—providing a principled mechanism for specific, high-value clarification rather than generic prompts.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
ENR separates aleatoric from epistemic uncertainty, focusing computation only on parameter uncertainty needed for Thompson sampling. It improved click-through rates 9% and ratings 6% while requiring 29% fewer interactions than baselines.
LinUCB frames news recommendation as a contextual bandit problem, explicitly balancing exploration of uncertain articles against exploitation of proven ones. The approach handles dynamic content and cold-start users better than traditional CF, with proven regret bounds and lower computational overhead.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.
LIMI achieves 73.5% on AgencyBench using only 78 curated multi-turn trajectories, outperforming models trained on 10,000+ samples by 53.7%. Complete interaction sequences capturing tool use and reasoning appear to activate latent agentic patterns already present in pretrained models.