Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does self-generated training data improve model learning?

Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.

Note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models?

SEAL (Self-Adapting Language Models) equips LLMs with the ability to generate "self-edits" — natural-language instructions that specify both the training data and optimization hyperparameters for updating the model's own weights. Given new factual knowledge to incorporate, instead of finetuning directly on the source text, the model generates its own synthetic training data optimized for self-learning.

The results are counter-intuitive: finetuning on self-generated data improves no-passage QA performance from 33.5% to 47.0%, outperforming data generated by GPT-4.1. A weaker model's self-generated data produces better learning outcomes than a stronger model's externally generated data.

The analogy to human learning is precise: students who rewrite lecture material in their own words consistently outperform students who study the original text. The restructuring process is itself the learning — it forces the learner to identify gaps, reframe concepts in familiar terms, and create connections to existing knowledge. Different learners restructure differently (visual diagrams, text summaries, mathematical formulations) because the optimal transformation depends on the learner's representational structure, not just the content.

For LLMs, this means the model's own distributional characteristics determine what data format will produce effective weight updates. A model with particular learned representations will learn more from data that aligns with those representations than from data optimized for a different model's internal structure.

The method uses RL to train the self-edit capability: the downstream performance of the updated model serves as the reward signal. This means the model learns not just what to study but how to study — selecting augmentation strategies and optimization hyperparameters alongside content.

On a simplified ARC-AGI subset, SEAL also outperforms both standard in-context learning and self-editing without RL training, showing that the quality of self-generated data improves with the RL-trained meta-learning capability.

Two converging methods from alignment research reinforce the self-generation principle. First, instruction backtranslation (Humpback) trains an LLM to generate instructions for unlabeled web text, then self-selects high-quality pairs through iterative curation — the model generates its own training signal and curates it. Second, Can aligned LLMs generate their own training data? (MAGPIE) shows that aligned models can generate 4 million instruction-response pairs from their pre-query template alone, outperforming human-curated datasets. Both methods demonstrate the same principle at different levels: self-generated data captures the model's own distributional preferences, producing more learnable training signal than external generation.

This extends Does training data format shape reasoning strategy more than domain? to training data generation: not just the format of training data matters, but who generates it. And it provides a constructive mechanism for Does teacher-refined data always improve student model performance? — the ideal "teacher" for data refinement is the student model itself.


Source: Self Refinement Self Consistency Feedback — SEAL: Self-Adapting Language Models (arxiv 2506.10943); enriched from Alignment

Related concepts in this collection

Concept map
16 direct connections · 168 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

self-generated training data outperforms externally generated data for knowledge incorporation because model-specific restructuring matches the learner's representational needs