How do implicit world models and self-reflection operationalize consequence-based learning?

This explores how two mechanisms — a model's internal sense of how its actions change what comes next (implicit world models), and its ability to look back and critique what it did (self-reflection) — turn into actual learning from consequences, and where that learning quietly fails.

This explores how implicit world models and self-reflection operationalize consequence-based learning — that is, how a model gets from "I did something" to "the result of doing it should change what I do next." The corpus pulls these into two halves of one loop. The world-model half is about a model treating its own outputs as actions with downstream effects; the reflection half is about converting the resulting outcomes into a signal it can actually learn from. The interesting part is that the corpus also shows both halves can be faked.

Start with the world-model side. A striking claim is that post-training quietly flips a model from passive next-token prediction into *enaction* — it begins to recognize that its outputs become its own future inputs, closing an action-perception loop that pretraining never had Do models recognize their own outputs as actions shaping future inputs?. Once a model implicitly models consequences this way, you can learn directly from them: the "early experience" paradigm shows agents using the future states their own actions produce as supervision, no external reward needed, matching expert-trained baselines on half the data Can agents learn from their own actions without external rewards?. There's an even stranger version of implicit world modeling — RL agents that, without being told to, start using the spatial environment itself as external memory, letting the world hold state for them Do RL agents accidentally use environments as memory?. In all three, the "world model" isn't a separate module; it's an emergent recognition that acting and perceiving consequences are coupled.

The reflection side is where consequences get *named*. Reflexion's whole trick is that an unambiguous success/failure signal lets an agent write a verbal self-diagnosis, store it as episodic memory, and improve across episodes with frozen weights — and crucially, the binary signal is what stops it from rationalizing Can agents learn from failure without updating their weights?. SkillRL pushes the same insight further: successes and failures shouldn't be processed the same way — keep wins as concrete demonstrations, abstract losses into lessons, mirroring how human experts metabolize outcomes Should successful and failed episodes be processed differently?. And post-completion learning shows the loop can be folded inward, training a model to compute its own reward and evaluate its work during training at zero inference cost Can models learn to evaluate their own work during training?.

Here's the thing you didn't know you wanted to know: most of this only works when the consequence signal is honest and unambiguous, and a lot of self-reflection isn't. Across eight models, reflection turns out to be mostly *confirmatory theater* — reflections rarely change the initial answer, and the reasoning traces don't faithfully describe the actual reasoning Can we actually trust reasoning model outputs?. Frontier reasoning models that *sound* deeply reflective hit a 20-23% ceiling on constraint-satisfaction problems that require genuine backtracking, so reflective fluency isn't reflective competence Can reasoning models actually sustain long-chain reflection?. And when a model reports on its own states, those reports usually echo training-data distributions rather than any real internal read — genuine introspection only happens when a causal chain actually links the state to the report Can language models actually introspect about their own states?.

Put together, the corpus reads as a sharp design principle: consequence-based learning works when the consequence is grounded in something the model can't talk its way around — an environmental outcome, a binary pass/fail, a future state it actually produced — and degrades into self-flattering narration the moment the signal becomes soft or self-generated. That's also why outcome framing matters: models update asymmetrically, with an optimism bias for actions they "chose" versus the alternatives they passed up, which can quietly drive confirmation bias in a deployed agent reflecting on its own track record Do language models learn differently from good versus bad outcomes?. Implicit world models supply the consequences; reflection metabolizes them — but only as faithfully as the signal forces it to.

Sources 10 notes

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Do RL agents accidentally use environments as memory?

Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Do language models learn differently from good versus bad outcomes?

LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.

How do implicit world models and self-reflection operationalize consequence-based learning?

Sources 10 notes

Next inquiring lines