Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can models improve themselves on tasks without verifiable answers?

Most self-improvement methods require objective correctness signals, limiting them to math and code. Can models self-improve on open-ended instruction tasks where answers can't be automatically verified?

Note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time?

Self-Reasoning Language Models (SRLM) addresses a bottleneck in inference-time scaling: most self-improvement methods only work on tasks with verifiable answers (math, code) where correctness can be checked automatically. For general instruction-following — open-ended tasks without deterministic answers — self-improvement has been stuck: you can't reward correctness if you can't verify it.

SRLM's solution: create a small set (~1000 samples) of "reasoning catalyst" data — demonstrations of how to transform short, shallow reasoning chains into longer, more comprehensive ones using meta-reasoning skills. This isn't training on correct answers. It's training on the process of enriching reasoning: showing the model how to unfold the implicit reasoning steps that shorter responses skip.

After training on this catalyst data alongside the original instruction-tuning dataset, the model acquires two capabilities: (a) the base task competence from instruction tuning, and (b) the ability to enrich its own reasoning. The model can then iteratively improve by generating enriched reasoning candidates for training examples, filtering with three quality selectors (no assumption about instruction type or answer format), and retraining on the improved data.

The key finding is stability: SRLM "not only enhances the model's initial performance but also ensures more stable and consistent improvements in subsequent iterations." This contrasts with prior methods that degrade or plateau after few iterations — the reasoning catalyst provides a persistent enrichment signal that doesn't exhaust.

The 1000-sample requirement is remarkably small — though Can a single training example unlock mathematical reasoning? pushes this even further for narrow domains. The difference may be task breadth: 1-shot activates math reasoning specifically, while catalyst data enables general instruction-following self-improvement. Both connect to Do base models already contain hidden reasoning ability? — the reasoning capability is latent in the pretrained model; the catalyst data doesn't teach reasoning but unlocks the ability to articulate it. Similarly, Can small models reason well by just learning output format? shows format is often the bottleneck, not capability.

The implication for the inference-scaling agenda: self-improvement at test time is not limited to domains with external verifiers. With the right catalyst data, models can improve on any task where more detailed reasoning would help — which is most tasks.


Source: Self Refinement Self Consistency Feedback — SRLM (arxiv 2505.14116)

Related concepts in this collection

Concept map
19 direct connections · 180 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

reasoning catalyst data — just 1000 demonstrations of how to enrich reasoning — enables self-improvement for general instruction tasks beyond math and code