Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can a single problem unlock reasoning through diverse critiques?

Does exposing models to many different critiques of one problem activate reasoning better than training on many different problems? This matters because it suggests data efficiency isn't the main constraint.

Note · 2026-04-18 · sourced from Reasoning Architectures

Critique Fine-Tuning (CFT) achieves reasoning activation comparable to RLVR using only a single problem. The method collects diverse model-generated solutions to one problem, then uses a teacher LLM to generate detailed critiques of each solution. Training on these critique pairs — without any reinforcement learning — unlocks reasoning performance at a fraction of the computational cost (RLVR requires hundreds of GPU hours).

The key insight is that the diversity that matters for reasoning activation is solution diversity (many approaches to one problem) rather than problem diversity (one approach to many problems). By holding the problem constant and varying the solutions, CFT isolates the critique-and-evaluation signal as the activation mechanism.

This is the most resource-efficient confirmation yet of Do base models already contain hidden reasoning ability?. The progression of evidence is striking: RL post-training (expensive), RLVR (cheaper), 1-shot RLVR (minimal data), and now CFT (minimal data AND no RL). Each step strips away another component previously thought essential, revealing that the activation signal is remarkably simple.

CFT also extends Can a single training example unlock mathematical reasoning? in an important direction. 1-shot RLVR shows one problem suffices when RL provides the training signal. CFT shows one problem suffices when critique provides the signal instead. The common denominator is not RL, not critique, not solution diversity per se — it is exposure to the distinction between correct and incorrect reasoning applied to a specific problem.

This connects to Does RL teach reasoning or teach when to use it? by providing yet another non-RL method that achieves similar activation. If RL, steering vectors, decoding changes, and now critique fine-tuning all unlock the same latent reasoning, the mechanism is clearly pre-training-determined and the elicitation method is incidental.

The relationship to Does critiquing errors teach deeper understanding than imitating correct answers? is direct: CFT operationalizes the principle that evaluating errors teaches more than imitating successes. But CFT goes further — it shows that evaluating errors on a single problem is sufficient, collapsing the data requirement to its theoretical minimum.

Source: Reasoning Architectures Paper: "Critique Fine-Tuning on One Problem" (2506.03295)

Original note title

critique fine-tuning on a single problem unlocks reasoning by exposing models to diverse solution critiques rather than diverse problems