Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can models improve themselves on tasks without verifiable answers?

Most self-improvement methods require objective correctness signals, limiting them to math and code. Can models self-improve on open-ended instruction tasks where answers can't be automatically verified?

Note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback

Self-Reasoning Language Models (SRLM) addresses a bottleneck in inference-time scaling: most self-improvement methods only work on tasks with verifiable answers (math, code) where correctness can be checked automatically. For general instruction-following — open-ended tasks without deterministic answers — self-improvement has been stuck: you can't reward correctness if you can't verify it.

SRLM's solution: create a small set (~1000 samples) of "reasoning catalyst" data — demonstrations of how to transform short, shallow reasoning chains into longer, more comprehensive ones using meta-reasoning skills. This isn't training on correct answers. It's training on the process of enriching reasoning: showing the model how to unfold the implicit reasoning steps that shorter responses skip.

After training on this catalyst data alongside the original instruction-tuning dataset, the model acquires two capabilities: (a) the base task competence from instruction tuning, and (b) the ability to enrich its own reasoning. The model can then iteratively improve by generating enriched reasoning candidates for training examples, filtering with three quality selectors (no assumption about instruction type or answer format), and retraining on the improved data.

The key finding is stability: SRLM "not only enhances the model's initial performance but also ensures more stable and consistent improvements in subsequent iterations." This contrasts with prior methods that degrade or plateau after few iterations — the reasoning catalyst provides a persistent enrichment signal that doesn't exhaust.

The 1000-sample requirement is remarkably small — though Can a single training example unlock mathematical reasoning? pushes this even further for narrow domains. The difference may be task breadth: 1-shot activates math reasoning specifically, while catalyst data enables general instruction-following self-improvement. Both connect to Do base models already contain hidden reasoning ability? — the reasoning capability is latent in the pretrained model; the catalyst data doesn't teach reasoning but unlocks the ability to articulate it. Similarly, Can small models reason well by just learning output format? shows format is often the bottleneck, not capability.

The implication for the inference-scaling agenda: self-improvement at test time is not limited to domains with external verifiers. With the right catalyst data, models can improve on any task where more detailed reasoning would help — which is most tasks.

Source: Self Refinement Self Consistency Feedback — SRLM (arxiv 2505.14116)

Related concepts in this collection

Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the catalyst principle: a small training signal unlocks what's already there
Can small models reason well by just learning output format? Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.
format adaptation as the key lever, not knowledge injection
What limits how much models can improve themselves? Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
SRLM's three selectors provide approximate verification for non-verifiable tasks, partially closing the gap
Can we prune training data without hurting model performance? This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
catalyst data represents the extreme end of data value concentration: the 1000 demonstrations are the irreducibly necessary examples that data pruning's difficulty metrics would identify as highest-value; both establish that training data value follows a power-law distribution
Can we train better models on less data? Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.
complementary data efficiency findings: LESS identifies the right 5% of existing data via gradient similarity, catalyst data shows 1000 purpose-built demonstrations suffice for reasoning enrichment; LESS provides the principled selection mechanism that could identify which catalyst-like examples matter most
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
catalyst data extends the when-not-how thesis to self-improvement: 1000 demonstrations teach the model when and how to enrich its own reasoning, not reasoning capability itself; the catalyst is an activation signal for latent enrichment ability, paralleling how RL teaches deployment timing rather than reasoning execution
Can a single training example unlock mathematical reasoning? Does minimal data suffice to activate latent reasoning capabilities in language models? This explores whether one example can produce dramatic performance gains comparable to much larger datasets.
pushes the minimal-data frontier even further: 1-shot RLVR achieves 37-point MATH500 gains from a single example, suggesting catalyst data's 1000-demonstration requirement may reflect task breadth rather than a fundamental data floor
What makes test-time training actually work in practice? Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.
catalyst data may address TTT's first component (task-similar finetuning) more efficiently than general data selection: 1000 demonstrations of reasoning enrichment provide a compact task-similar foundation that TTT can refine per-instance, and the stability of catalyst-based self-improvement suggests the auxiliary task format (TTT's second component) could be reasoning enrichment itself
Can 78 demonstrations teach agency better than 10000? Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.
extends the minimal-data principle from reasoning to agency: 78 curated trajectories activate agentic behavior just as 1000 catalyst demonstrations activate reasoning enrichment; together they establish a cross-domain pattern where capability activation requires showing the model what the capability looks like

Concept map

19 direct connections · 180 in 2-hop network ·dense cluster

Can models improve themselves on tasks without v… Do base models already contain hidden reasoning ab… Can small models reason well by just learning outp… What limits how much models can improve themselves… Can we prune training data without hurting model p… Can we train better models on less data? Does RL teach reasoning or just when to use it? Can a single training example unlock mathematical … What makes test-time training actually work in pra…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

reasoning catalyst data — just 1000 demonstrations of how to enrich reasoning — enables self-improvement for general instruction tasks beyond math and code