LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can models learn reasoning from predicting text alone?

Can language models bootstrap general reasoning abilities by generating explanations at every token position during pretraining, without task-specific supervision? This explores whether reasoning emerges naturally from optimizing predictive accuracy.

Note · 2026-02-22 · sourced from Reasoning by Reflection

STaR showed that LMs can bootstrap reasoning by training on rationales that led to correct answers on curated QA datasets. Quiet-STaR generalizes this in one critical way: rather than generating a rationale per problem, it generates a rationale at every token position to explain future text. The training corpus is arbitrary internet text, not curated reasoning tasks.

The mechanism: at each token, the model generates a thought, mixes the thought-conditioned next-token prediction with the raw next-token prediction via a learned mixing head, and uses REINFORCE to improve thought quality. Custom meta-tokens signal thought boundaries, allowing the model to learn when to generate rationales and when to commit predictions.

The key shift: from task-specific reasoning ("do this type of math problem") to text-general reasoning ("what reasoning helps predict what comes next in any text?"). STaR's ceiling was its dependency on curated QA datasets — high-quality, but inherently narrow. Quiet-STaR's ceiling is the diversity of the pretraining corpus.

Because rationale quality is judged by predictive accuracy on future text rather than correctness on labeled answers, the method generalizes across the tasks present in language rather than the tasks present in annotation pipelines. The "task" is prediction itself.

This remains constrained by training distribution: rationales that help predict common internet text patterns may not generalize to hard reasoning requiring novel inference that rarely appears in the corpus. But it suggests that general reasoning competence may be trainable as a side effect of improved language modeling, rather than as a separate supervised objective.

Source: Reasoning by Reflection

Related concepts in this collection

Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
complements: Quiet-STaR offers a pretraining-time mechanism for the same underlying capability
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
contrasts: Quiet-STaR bakes reasoning into the forward pass at every token; RL teaches when to engage reasoning mechanisms at deployment
Why doesn't mathematical reasoning transfer to medicine? Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.
extends: Quiet-STaR's ceiling is training distribution diversity; this note explains why general reasoning competence, however trained, hits a floor when domain-specific knowledge is absent
Can training data itself teach harder reasoning steps? Can augmenting pretraining data with generated reasoning trajectories help models learn complex multi-step reasoning more efficiently? This explores whether intermediate explanations in training data unlock capabilities standard next-token prediction misses.
parallel token-level reasoning during pretraining: Quiet-STaR modifies the training objective to learn rationales at each token, while TPT augments the training data with externally-generated thinking trajectories; different intervention points (objective vs. data) targeting the same problem of making pretraining reasoning-aware
Can models learn to internalize search as reasoning? Does training on linearized search traces teach models to implement search algorithms internally, expanding what they can discover during reasoning? This matters because it could unlock entirely new problem-solving modes beyond standard chain-of-thought.
complementary internalization: Quiet-STaR trains token-level rationale generation via predictive accuracy, while Meta-CoT trains trace-level search strategies via linearized MCTS/A* — together they suggest reasoning internalization is possible at multiple granularities from individual predictions to complete search procedures
Can pretraining corpora themselves provide verifiable RL rewards? Does framing next-token prediction as a reasoning task with ground-truth verification eliminate the need for human feedback or domain-specific rewards during language model pretraining?
parallel approach: RPT uses next-token verification as RL reward signal at the same token-level granularity; Quiet-STaR generates rationales via REINFORCE while RPT reasons about predictions via RL, both treating the pretraining corpus as the training signal for reasoning
Can models learn to evaluate their own work during training? Explores whether language models can internalize reward function computation as part of training, transforming external feedback into internal self-assessment capability without slowing inference.
complementary training-time reasoning augmentation: Quiet-STaR generates rationales at every token position, PCL generates self-evaluations in post-EOS space; both add auxiliary reasoning during training that shapes the model without inference cost, but at different positions (pre-token vs. post-answer)

Concept map

14 direct connections · 126 in 2-hop network ·dense cluster

Can models learn reasoning from predicting text … Do base models already contain hidden reasoning ab… Does RL teach reasoning or just when to use it? Why doesn't mathematical reasoning transfer to med… Can training data itself teach harder reasoning st… Can models learn to internalize search as reasonin… Can pretraining corpora themselves provide verifia… Can models learn to evaluate their own work during…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

quiet-star learns rationale generation at the token level not the task level enabling general reasoning without task-specific supervision