Do bidirectional and any-order generation expose different parts of the joint distribution?
This explores whether the *direction* a model generates text — strictly left-to-right vs. filling in any position in any order — actually changes which parts of the underlying probability landscape (the 'joint distribution' over all the tokens together) the model can reach.
This explores whether generation direction is just a mechanical detail or whether it genuinely changes what a model can express. The short version the corpus suggests: yes, it matters — autoregressive (left-to-right) generation commits to a chain of decisions it can never take back, while bidirectional and any-order generation can revisit and refine, which opens up regions of the joint distribution the first approach structurally can't reach.
The clearest evidence is the contrast around constraint satisfaction. Token-by-token autoregressive generation lacks a 'retraction' primitive — once a token is emitted, it's fixed, so problems that require discarding an invalid partial guess and backing up hit an architectural ceiling, not just a model-quality one Why does autoregressive generation fail at constraint satisfaction?. That limitation is really a statement about the joint distribution: left-to-right factorization forces every later token to be conditioned on a frozen prefix, so any joint configuration that's only discoverable by editing earlier choices is effectively unreachable. Diffusion LLMs attack exactly this seam — their bidirectional attention lets reasoning and answer tokens be refined *simultaneously* across masked positions rather than in prefix order, so confidence on the answer can converge early while the reasoning around it keeps adjusting Can reasoning and answers be generated separately in language models?. That's not just faster sampling; it's accessing joint structure through a different door.
But here's the twist worth sitting with: same direction doesn't guarantee a different *shape* of distribution. Even autoregressive generation isn't really 'one path' — temperature-zero or fixed-seed settings just replay a single draw from the same distribution, which feels reliable but is statistically just one sample among many Does setting temperature to zero actually make LLM outputs reliable?. And the way ordinary generation flows is described as a *smooth* probabilistic continuation toward the training distribution — it doesn't explore competing or contradictory branches as it goes; it follows the path of least surprise Does LLM generation explore competing claims while producing text?. So the interesting question isn't only 'can we reach more configurations' but 'do these regimes sample the same landscape differently' — any-order generation potentially exposes the high-constraint, mutually-dependent corners that smooth left-to-right flow tends to glide past.
There's also a deeper framing lurking here: where the real computation lives. If reasoning is mostly a latent-state trajectory and the surface text is only a partial interface to it Where does LLM reasoning actually happen during generation?, then 'direction of generation' is partly about how much of that hidden trajectory each scheme lets you re-enter and revise. Relatedly, left-to-right ordering is sequential but *atemporal* — there's no pause-and-reconsider between tokens Does AI text generation unfold through temporal reflection?. Any-order refinement is, in a sense, the architecture's substitute for that missing reconsideration: instead of revising over time, it revises over position.
What you didn't know you wanted to know: this same 'generation as something you can loop back into' idea shows up outside the decoder, too — systems that feed a model's own partial answer back as a new retrieval query surface information gaps the original question couldn't express Can a model's partial response guide what to retrieve next?, and bidirectional RAG can even fold verified generations back into its knowledge base Can RAG systems safely learn from their own generated answers?. The thread connecting all of these to your question: the more a system can revisit and revise its own output rather than committing irreversibly forward, the more of the joint structure it can actually touch.
Sources 8 notes
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.
ICE shows that bidirectional attention in diffusion LLMs enables in-place prompting—embedding reasoning directly in masked positions refined alongside answers. Answer confidence converges early while reasoning continues refining, allowing early-exit mechanisms to cut compute by 50% while maintaining accuracy.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Token prediction trains models to continue toward the training distribution, not to explore logically related counterpositions. This smoothness in process produces smooth claims that multiply without generating new perspectives.
Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.
Token ordering in LLMs follows probabilistic selection without intervening reflection or revision. Human discourse gains meaning from temporal structure—time spent thinking changes what comes next—but AI text production lacks this duration-in-reflection despite appearing sequentially composed.
ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.
Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.