Can models learn better by training on messy exploration paths?

Does including trial-and-error, reflection, and backtracking in training data teach models to reason more robustly than teaching only the polished shortest path to answers?

Synthesis note · 2026-06-03 · sourced from Deep Research

Responding to OpenAI's opaque O1, this real-time replication effort contributes a paradigm beyond the engineering: journey learning. Where standard training teaches a model the shortcut — the clean path from problem to correct answer — journey learning encourages models to learn the complete exploration process: trial and error, reflection, and backtracking. The bet is that o1-style deep reasoning comes from internalizing how to search (including dead ends and recoveries), not from memorizing polished solution traces. The paper also models a methodological stance — transparent, continuously-documented, community-engaged research that reports failures as well as successes.

The keeper is the training-data philosophy: include the messy trajectory (failed attempts, self-correction) as the supervision signal, because that is what teaches robust reasoning, whereas shortcut-only data teaches confident-but-brittle answers.

This sits in the vault's reasoning-training thread. It is the constructive counter to the finding that Is reflection in reasoning models actually fixing mistakes? — journey learning tries to make exploration genuine rather than performative — and it pairs with When does RL actually extend reasoning beyond pretraining?: both concern what reasoning data actually teaches the model.

Inquiring lines that use this note as a source 3

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 141 in 2-hop network ·dense cluster Open in graph ↗

Can models learn better by training on messy exp… Is reflection in reasoning models actually fixing … When does RL actually extend reasoning beyond pret… Can experiment failures drive progress instead of …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Is reflection in reasoning models actually fixing mistakes? Do the thinking steps that appear after a model's first answer represent genuine self-correction, or are they mostly confirming what the model already concluded? Understanding this matters for how we train and deploy reasoning systems.
journey learning aims to make exploration/reflection genuine rather than confirmatory theater
When does RL actually extend reasoning beyond pretraining? Does reinforcement learning genuinely expand a model's reasoning capabilities, or does it merely improve sampling from existing knowledge? This question hinges on whether pretraining provides sufficient foundation and whether RL targets tasks within reach.
both concern what reasoning training data actually teaches
Can experiment failures drive progress instead of stopping it? Explores whether autonomous research systems can treat failed runs as information rather than termination signals. This matters because real science is iterative, and systems that halt on errors cannot learn from failure.
failures-as-information at the training-data level rather than the agent-execution level

Can models learn better by training on messy exploration paths?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 5