Why does bidirectional attention in diffusion models prevent KV cache reuse?

This explores a structural trade-off in how diffusion language models attend — and why the speed trick that makes autoregressive models cheap (the KV cache) doesn't transfer to them.

This question is really about a mismatch between two ways of generating text. In a standard autoregressive model, attention is *causal*: each token can only look backward at tokens already written. Because the past never changes once it's generated, the keys and values computed for those earlier tokens are frozen — you compute them once and reuse them for every future step. That reuse is the KV cache, and it's the single biggest reason autoregressive decoding is fast. Diffusion language models break this. Their attention is *bidirectional* — every position attends to every other position, including positions 'ahead' of it — and the whole sequence is repeatedly revised across denoising steps. When all tokens are mutable and all positions see each other, there are no frozen keys and values to cache; each refinement step recomputes attention over a sequence that just changed underneath it. The bidirectionality and the cache are simply incompatible: the cache exists only because causal attention guarantees the past is fixed.

The corpus doesn't have a note aimed squarely at this engineering point, but it holds the pieces that make the trade-off legible. The clearest doorway is Can diffusion models commit to answers before full decoding?, which shows diffusion models converge to the right answer roughly halfway through refinement — up to 99% of MMLU instances by the midpoint. That matters here because it reframes the efficiency story: diffusion gives up KV cache reuse, but it can claw speed back from the *other* direction, by stopping refinement early once confidence stabilizes. So the lost cache isn't the end of the efficiency conversation — it shifts where the savings come from.

It's worth seeing how hard autoregressive systems lean on the cache to appreciate what diffusion forfeits. Can recursive subtask trees overcome context window limits? treats the KV cache as the actual working memory of reasoning — pruning it with rules to sustain long chains even while discarding 90% of it. That whole strategy presupposes a stable, append-only cache you can selectively keep or evict. Bidirectional attention removes the premise: you can't prune a cache that's being fully recomputed every step.

Two more notes reframe the underlying tension. Is long-context bottleneck really about memory or compute? argues the real long-context constraint was never memory capacity but the *compute* to consolidate context into state — which is exactly the cost diffusion pays in full at every refinement pass. And Does transformer attention architecture inherently favor repeated content? is a reminder that attention's structure isn't neutral plumbing; the directionality of attention has downstream consequences for what a model can do efficiently and even how it behaves.

The thing you might not have expected to learn: the KV cache isn't a generic optimization that diffusion models forgot to implement — it's a privilege that causal, left-to-right generation earns by promising never to revise the past. Bidirectional diffusion buys global, revisable context, and the price of that revisability is precisely the cache. The interesting research frontier, as Can diffusion models commit to answers before full decoding? hints, is recovering speed through early convergence rather than mourning the cache.

Sources 4 notes

Can diffusion models commit to answers before full decoding?

Up to 99% of MMLU instances and 97% of GSM8K instances reach correct answers by the midpoint of refinement. Prophet exploits this by monitoring confidence gaps to stop early, achieving 3.4× speedup with no quality loss.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Does transformer attention architecture inherently favor repeated content?

Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.

Why does bidirectional attention in diffusion models prevent KV cache reuse?

Sources 4 notes

Next inquiring lines