Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does self-generated training data improve model learning?

Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.

Note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback

SEAL (Self-Adapting Language Models) equips LLMs with the ability to generate "self-edits" — natural-language instructions that specify both the training data and optimization hyperparameters for updating the model's own weights. Given new factual knowledge to incorporate, instead of finetuning directly on the source text, the model generates its own synthetic training data optimized for self-learning.

The results are counter-intuitive: finetuning on self-generated data improves no-passage QA performance from 33.5% to 47.0%, outperforming data generated by GPT-4.1. A weaker model's self-generated data produces better learning outcomes than a stronger model's externally generated data.

The analogy to human learning is precise: students who rewrite lecture material in their own words consistently outperform students who study the original text. The restructuring process is itself the learning — it forces the learner to identify gaps, reframe concepts in familiar terms, and create connections to existing knowledge. Different learners restructure differently (visual diagrams, text summaries, mathematical formulations) because the optimal transformation depends on the learner's representational structure, not just the content.

For LLMs, this means the model's own distributional characteristics determine what data format will produce effective weight updates. A model with particular learned representations will learn more from data that aligns with those representations than from data optimized for a different model's internal structure.

The method uses RL to train the self-edit capability: the downstream performance of the updated model serves as the reward signal. This means the model learns not just what to study but how to study — selecting augmentation strategies and optimization hyperparameters alongside content.

On a simplified ARC-AGI subset, SEAL also outperforms both standard in-context learning and self-editing without RL training, showing that the quality of self-generated data improves with the RL-trained meta-learning capability.

Two converging methods from alignment research reinforce the self-generation principle. First, instruction backtranslation (Humpback) trains an LLM to generate instructions for unlabeled web text, then self-selects high-quality pairs through iterative curation — the model generates its own training signal and curates it. Second, Can aligned LLMs generate their own training data? (MAGPIE) shows that aligned models can generate 4 million instruction-response pairs from their pre-query template alone, outperforming human-curated datasets. Both methods demonstrate the same principle at different levels: self-generated data captures the model's own distributional preferences, producing more learnable training signal than external generation.

This extends Does training data format shape reasoning strategy more than domain? to training data generation: not just the format of training data matters, but who generates it. And it provides a constructive mechanism for Does teacher-refined data always improve student model performance? — the ideal "teacher" for data refinement is the student model itself.

Source: Self Refinement Self Consistency Feedback — SEAL: Self-Adapting Language Models (arxiv 2506.10943); enriched from Alignment

Related concepts in this collection

Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
extends: not just format but source of generation matters
Does teacher-refined data always improve student model performance? Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.
SEAL resolves this by making the student its own teacher
Does supervised fine-tuning actually improve reasoning quality? While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
SEAL's self-generated data may avoid the SFT degradation by producing data aligned with the model's existing reasoning patterns
Can aligned LLMs generate their own training data? Does feeding an aligned model only its prompt template cause it to self-synthesize high-quality instructions? This explores whether alignment training encodes a latent instruction-generation capability.
MAGPIE: self-generation at instruction level; 4M pairs from template alone
How do knowledge injection methods trade off flexibility and cost? When and how should domain knowledge enter an AI system? This explores the speed, training cost, and adaptability trade-offs across four injection paradigms, and when each approach suits different deployment constraints.
SEAL introduces a fifth paradigm not in the four-way taxonomy: self-generated static injection, where the model produces its own training data optimized for its own representational structure, combining the inference-speed advantage of static embedding with the learner-specific optimization that external data generation cannot achieve

Concept map

16 direct connections · 168 in 2-hop network ·dense cluster

Does self-generated training data improve model … Does training data format shape reasoning strategy… Does teacher-refined data always improve student m… Does supervised fine-tuning actually improve reaso… Can aligned LLMs generate their own training data? How do knowledge injection methods trade off flexi…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

self-generated training data outperforms externally generated data for knowledge incorporation because model-specific restructuring matches the learner's representational needs