LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can models learn reasoning from predicting text alone?

Can language models bootstrap general reasoning abilities by generating explanations at every token position during pretraining, without task-specific supervision? This explores whether reasoning emerges naturally from optimizing predictive accuracy.

Note · 2026-02-22 · sourced from Reasoning by Reflection
How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

STaR showed that LMs can bootstrap reasoning by training on rationales that led to correct answers on curated QA datasets. Quiet-STaR generalizes this in one critical way: rather than generating a rationale per problem, it generates a rationale at every token position to explain future text. The training corpus is arbitrary internet text, not curated reasoning tasks.

The mechanism: at each token, the model generates a thought, mixes the thought-conditioned next-token prediction with the raw next-token prediction via a learned mixing head, and uses REINFORCE to improve thought quality. Custom meta-tokens signal thought boundaries, allowing the model to learn when to generate rationales and when to commit predictions.

The key shift: from task-specific reasoning ("do this type of math problem") to text-general reasoning ("what reasoning helps predict what comes next in any text?"). STaR's ceiling was its dependency on curated QA datasets — high-quality, but inherently narrow. Quiet-STaR's ceiling is the diversity of the pretraining corpus.

Because rationale quality is judged by predictive accuracy on future text rather than correctness on labeled answers, the method generalizes across the tasks present in language rather than the tasks present in annotation pipelines. The "task" is prediction itself.

This remains constrained by training distribution: rationales that help predict common internet text patterns may not generalize to hard reasoning requiring novel inference that rarely appears in the corpus. But it suggests that general reasoning competence may be trainable as a side effect of improved language modeling, rather than as a separate supervised objective.


Source: Reasoning by Reflection

Related concepts in this collection

Concept map
14 direct connections · 126 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

quiet-star learns rationale generation at the token level not the task level enabling general reasoning without task-specific supervision