What makes the embers of autoregression framework predictive?
This explores why thinking about LLM behavior as the lingering imprint of next-token (autoregressive) training — the 'embers' left by that objective — lets you predict where models will succeed and fail, rather than just describing it after the fact.
This explores why the 'embers of autoregression' lens is predictive: the claim is that a model's failures aren't random quirks but residue of being trained to generate one token at a time, left-to-right, by probability. If that's true, you should be able to point at the objective and forecast the failure before you run the model — and the corpus has several sharp cases where exactly that works.
The cleanest demonstration is constraint satisfaction. The reason models hit a ceiling on these problems isn't that they're undertrained — it's that token-by-token generation can't *retract* a token it already emitted, while constraint solvers fundamentally depend on discarding bad partial guesses Why does autoregressive generation fail at constraint satisfaction?. That's a prediction you can make purely from the generation mechanism, no benchmark required: any task that needs backtracking will fail, and bolting on a symbolic solver fixes it precisely because the solver supplies what the architecture structurally lacks. The 'ember' framework is predictive here because it locates the limit in the *shape of generation*, not the *quality of the model*.
What makes the framework genuinely explanatory rather than just a label is that the autoregressive factorization turns out to be *contingent, not necessary*. Diffusion language models match autoregressive scaling, which means scaling comes from transformers, data, and Fisher consistency — not from left-to-right generation itself Does autoregressive generation uniquely enable LLM scaling?. Once you see autoregression as one choice among alternatives, its embers become legible by contrast: diffusion models can do gradient-based global control over a whole sequence that autoregressive models can't reach Can diffusion models enable control that autoregressive models cannot reach?, and they unlock parallel, non-sequential generation Can diffusion language models match autoregressive inference speed?. Each capability a non-autoregressive model gains for free is, read backward, an ember the autoregressive one is stuck with.
There's a deeper twist the corpus surfaces: the autoregressive objective doesn't just constrain — it imprints behavior. Post-training shifts a model from passive next-token prediction toward treating its own outputs as actions that become its future inputs, closing an action-perception loop, with measurable signatures like 3–4x lower output entropy on its own trajectories Do models recognize their own outputs as actions shaping future inputs?. That's the framework being predictive in the other direction — telling you what new behaviors emerge once the model is conditioned on sequences it generated itself. And the same factorization that makes autoregression predictable is what makes its alternatives hard to train: diffusion breaks the log-likelihood factorization that reinforcement learning methods like GRPO and DPO rely on, so the very property that gives autoregression its tractable structure is the one diffusion has to work around Why can't we easily adapt reinforcement learning to diffusion language models?.
Worth being honest: the corpus here addresses the *territory* of the embers framework — architectural residue of the generation objective — under different vocabulary, rather than containing the framing paper itself. The thread that ties it together is the move the question is really asking about: stop treating model behavior as a black box to be measured, and start deriving it from the objective. When that derivation holds — retraction, global control, entropy collapse — the embers stop being a metaphor and become a forecast.
Sources 6 notes
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.
LLaDA demonstrates that non-autoregressive diffusion models match autoregressive scaling performance. This suggests scalability emerges from the interplay of architecture, dataset size, and Fisher-consistent principles—meaning autoregressive factorization is contingent rather than necessary.
Diffusion-LM succeeds on six fine-grained control tasks (syntax, semantics, infilling, length) where plug-and-play methods fail. Its continuous latent variables allow gradients to flow across the entire sequence simultaneously, replacing the discrete-token bottleneck and enabling parallel denoising.
Discrete Diffusion Forcing breaks the speed barrier through block-wise autoregressive generation with KV cache reuse and inter-block parallel decoding. This hybrid approach recovers both the compute efficiency of AR and the parallelism advantage of diffusion.
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
Diffusion language models cannot directly use AR-developed RL methods like GRPO and DPO because iterative non-sequential token generation requires marginalizing over denoising trajectories, making likelihood intractable. Workarounds exist—outcome-based rewards, policy learning for unmasking order, and adapted preference optimization—enabling models like DCoLT to gain 9–19% on benchmarks.