Can models learn better by training on messy exploration paths?
Does including trial-and-error, reflection, and backtracking in training data teach models to reason more robustly than teaching only the polished shortest path to answers?
Responding to OpenAI's opaque O1, this real-time replication effort contributes a paradigm beyond the engineering: journey learning. Where standard training teaches a model the shortcut — the clean path from problem to correct answer — journey learning encourages models to learn the complete exploration process: trial and error, reflection, and backtracking. The bet is that o1-style deep reasoning comes from internalizing how to search (including dead ends and recoveries), not from memorizing polished solution traces. The paper also models a methodological stance — transparent, continuously-documented, community-engaged research that reports failures as well as successes.
The keeper is the training-data philosophy: include the messy trajectory (failed attempts, self-correction) as the supervision signal, because that is what teaches robust reasoning, whereas shortcut-only data teaches confident-but-brittle answers.
This sits in the vault's reasoning-training thread. It is the constructive counter to the finding that Is reflection in reasoning models actually fixing mistakes? — journey learning tries to make exploration genuine rather than performative — and it pairs with When does RL actually extend reasoning beyond pretraining?: both concern what reasoning data actually teaches the model.
Inquiring lines that use this note as a source 3
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Is reflection in reasoning models actually fixing mistakes?
Do the thinking steps that appear after a model's first answer represent genuine self-correction, or are they mostly confirming what the model already concluded? Understanding this matters for how we train and deploy reasoning systems.
journey learning aims to make exploration/reflection genuine rather than confirmatory theater
-
When does RL actually extend reasoning beyond pretraining?
Does reinforcement learning genuinely expand a model's reasoning capabilities, or does it merely improve sampling from existing knowledge? This question hinges on whether pretraining provides sufficient foundation and whether RL targets tasks within reach.
both concern what reasoning training data actually teaches
-
Can experiment failures drive progress instead of stopping it?
Explores whether autonomous research systems can treat failed runs as information rather than termination signals. This matters because real science is iterative, and systems that halt on errors cannot learn from failure.
failures-as-information at the training-data level rather than the agent-execution level
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- Teaching Large Language Models to Reason with Reinforcement Learning
- Reasoning Language Models: A Blueprint
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
- Large Language Models Think Too Fast To Explore Effectively
Original note title
journey learning trains models on the complete exploration process — trial error reflection and backtracking — not just shortcut solutions