How does trajectory burstiness compare to other structural properties that shape emergent capabilities?
This explores whether 'trajectory burstiness' — the idea that capability gains come in uneven, structurally-driven jumps along a model's path — is a real shaper of emergent ability, set against other structural properties the corpus identifies (training phases, entropy dynamics, trajectory topology, and measurement choices).
This explores whether the burstiness of a model's trajectory really shapes what capabilities emerge — and the corpus's most provocative answer is to first ask whether the "emergence" being shaped is real at all. The headline claim is that sharp, sudden capability jumps are largely a Are LLM emergent abilities real or measurement artifacts? of how we measure: switch from pass/fail metrics to continuous ones and the bursts smooth into predictable gains. The same skepticism shows up in a sibling finding — the famous exploration-exploitation trade-off looks fundamental at the token level but Is the exploration-exploitation trade-off actually fundamental? when you measure hidden states instead. So before treating burstiness as a structural property, the corpus warns: some of what looks bursty is an artifact of the ruler, not the territory.
What survives that scrutiny is structure in *time*. RL training reliably moves through a Does RL training follow a predictable two-phase learning sequence? — first execution correctness gets consolidated, then strategic planning becomes the bottleneck — with planning-token entropy rising while execution entropy stabilizes. This is the real version of "burstiness": capability doesn't accumulate evenly, it shifts which sub-skill is doing the learning. A related view argues RL post-training Does RL post-training create reasoning or just deploy it? rather than creating new reasoning — base models already hold the strategies latently, and training just optimizes *when* to deploy them, which is why gains can appear abruptly even though nothing new was built.
A second structural axis is entropy and diversity, which behaves like a budget that bursts can spend down. Training order mechanically reshapes it: structured domains Does training order reshape how models handle different task types? while creative ones raise it, so scheduling structured-first protects open-ended capability from collapse. The cost of ignoring this shows up in search agents, where RL Does reinforcement learning squeeze exploration diversity in search agents? through the same entropy-collapse mechanism seen in reasoning — policies converge onto narrow reward-maximizing paths and lose breadth. So a property adjacent to burstiness is *fragility*: uneven optimization can crater the diversity that future capability depends on.
The corpus also offers a more literal reading — that the *shape of the trajectory itself* carries usable structure. Process supervision can be Can trajectory structure replace hand-annotated process rewards? like tree topology, expert-aligned actions, or tool-call positions, turning sparse outcome rewards into dense step signals without annotated reward models. And trajectories double as memory: RL agents Do RL agents accidentally use environments as memory?, while a broader theory recasts cognition as Can cognition work by reusing memory instead of recomputing? rather than recomputed from scratch. Here the trajectory isn't just where capability emerges — it's the substrate.
The thread that ties burstiness to all of these is that emergence is path-dependent and self-referential. Post-training flips a model from passive prediction into Do models recognize their own outputs as actions shaping future inputs? — it begins treating its own outputs as actions that shape its next inputs, closing a feedback loop. That loop is also where things go wrong: pure Can models reliably improve themselves without external feedback? without an external anchor, because the trajectory feeds on itself. So the honest comparison is this — burstiness is best understood not as a standalone property but as one symptom of trajectory-level dynamics (phase shifts, entropy budgets, feedback loops), some of which are real structure and some of which dissolve under a better metric.
Sources 11 notes
Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.
Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.
Memory-Amortized Inference proposes intelligence arises from structured reuse of prior inference paths over topological memory, inverting RL's reward-forward logic into cause-backward reconstruction. This duality explains energy efficiency and suggests memory trajectories form the substrate of adaptive thought.
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.