INQUIRING LINE

Can layer-wise KV caches enable truly lossless information transfer?

This explores whether sharing transformer KV caches directly between models (instead of passing text back and forth) really preserves information without loss — and what 'lossless' even means here.


This explores whether sharing transformer KV caches directly between models — rather than serializing thoughts into text — can move information without losing anything in translation. The corpus's clearest answer comes from latent multi-agent collaboration, where agents pass their internal representations to each other through KV caches rather than re-encoding everything as language Can agents share thoughts without converting them to text?. The payoff is striking: large accuracy gains and a 70–84% reduction in tokens, with no extra training. The intuition is that text is a lossy bottleneck — when a model writes out its reasoning, it collapses a rich, high-dimensional hidden state into a flat sequence of words, and the next model has to reconstruct that state from scratch. Hand over the cache directly and you skip the round-trip.

But 'lossless' deserves scrutiny, because the corpus shows the KV cache is something people aggressively *throw away*, not preserve. One line of work structures reasoning as recursive subtask trees and then prunes up to 90% of the cache by rule, and reasoning stays accurate Can recursive subtask trees overcome context window limits?. That's the opposite premise: most of what's in the cache is disposable. If aggressive pruning costs nothing, then the cache was never carrying that much irreducible information — which complicates the claim that transferring it whole is what makes transfer lossless. The honest reading is that latent transfer is lossless *relative to text serialization*, not lossless in some absolute sense.

The deeper tension the corpus surfaces is whether the cache is even the right thing to move. A separate result argues the long-context bottleneck isn't memory capacity at all but the *compute* needed to consolidate evicted context into a model's fast weights — performance climbs with more consolidation passes, like a sleep cycle Is long-context bottleneck really about memory or compute?. From that angle, raw KV state is unconsolidated; the real information lives in what a model *does* with it. Memory-folding agents make the same move from the other direction, compressing interaction history into structured schemas and finding that the structure, not the verbatim retention, is what prevents degradation Can agents compress their own memory without losing critical details?. And Markov-style reasoning systems deliberately discard accumulated history entirely, keeping only the current problem state, and still preserve answer-equivalence Can reasoning systems forget history without losing coherence?. If you can forget the past and lose nothing that matters, 'lossless' is doing less work than it sounds.

There's also a layer-wise wrinkle worth knowing. KV caches are per-layer objects, and the corpus has a related finding that adjacent transformer blocks are similar enough to *share* — recomputing one block twice beats moving separate weights on memory-bound hardware Does recomputing weights cost less than moving them on mobile?. That redundancy between layers hints that a layer-by-layer cache isn't a set of independent, irreplaceable signals; some of it is recoverable, which is exactly why pruning and sharing work. So the most useful way to hold the question: direct KV transfer genuinely beats text as a channel, but its losslessness is a comparison, not an absolute — the field is simultaneously discovering how much of the cache it can safely throw away.


Sources 6 notes

Can agents share thoughts without converting them to text?

LatentMAS enables agents to share internal representations directly via KV caches, reaching 14.6% accuracy gains and 70.8-83.7% token reduction with no additional training. Hidden embeddings preserve reasoning fidelity that text-based systems cannot.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Next inquiring lines