Does selecting examples from multiple complexity levels outperform selecting only high-quality examples?

This explores whether spreading training examples across a range of difficulty levels beats simply picking the 'best' or hardest examples — and the corpus reframes the question by showing that 'high quality' only means something relative to what the learner can absorb.

This explores whether spreading training examples across a range of complexity levels beats picking only the 'highest-quality' ones — and the collection's sharpest move is to challenge what 'high quality' even means. Several notes converge on the same surprising idea: quality is not an absolute property of an example, but a relationship between the example and the learner. Teacher-refined data that is objectively better can actively *degrade* a student model when it sits beyond the student's learning frontier, so the fix is for students to filter refinements against their own statistical profile and keep only what's compatible Does teacher-refined data always improve student model performance?. Push this to the extreme and you get the failure case for 'just pick the hardest, richest examples': training on near-impossible problems causes models to learn degenerate shortcuts — answer repetition, skipped computation — that then contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. So selecting purely for difficulty or 'quality' can be worse than worthless.

Sources 7 notes

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Can careful selection of 78 demos outperform massive training datasets?

LIMI achieves 73.5% on AgencyBench using only 78 curated multi-turn trajectories, outperforming models trained on 10,000+ samples by 53.7%. Complete interaction sequences capturing tool use and reasoning appear to activate latent agentic patterns already present in pretrained models.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Does selecting examples from multiple complexity levels outperform selecting only high-quality examples?

Sources 7 notes

Next inquiring lines