INQUIRING LINE

Why do LLMs degrade on long inputs before hitting context limits?

This explores why model performance drops well inside the advertised context window — so the failure is about how attention and reasoning handle volume, not about literally running out of room.


This explores why LLMs get worse on long inputs *before* they hit the context limit — meaning the problem isn't capacity, it's what happens to attention and reasoning as the input grows. The collection converges on a counterintuitive answer: the context window is a storage spec, not a usage spec, and several different mechanisms erode quality long before the buffer fills.

The sharpest reframe is that the long-context bottleneck is *compute, not memory* Is long-context bottleneck really about memory or compute?. The limiting factor isn't whether the tokens fit — it's the work required to consolidate everything into a usable internal state. When that consolidation budget runs short, the model holds the text but can't actually *use* it, and performance scales with how many consolidation passes it gets rather than with raw window size. That's why simply having room left over doesn't guarantee comprehension.

A second mechanism is premature commitment. When information arrives gradually — as it does in long multi-turn exchanges — models lock into early guesses and can't course-correct, dropping from ~90% single-shot accuracy to ~65% across natural conversation Why do AI assistants get worse at longer conversations?, with a 39% average degradation seen across 200,000+ conversations Why do language models fail in gradually revealed conversations?. There's even a mechanistic fingerprint: uncertainty signals dominate the early transformer layers while the signals that reward keeping options open emerge only later, so the model decides before it has fully "read" Why do large language models explore less effectively than humans?. Long input isn't neutral — it's more surface area to commit to the wrong thing early.

Third, errors compound silently. Across long delegated workflows, frontier models corrupt roughly 25% of document content, and the errors keep accumulating through 50 round-trips without ever plateauing Do frontier LLMs silently corrupt documents in long workflows?. The same brittleness shows up structurally: grammatical and parsing competence degrades predictably as syntactic depth and embedding increase, suggesting the model leans on surface heuristics that hold for simple spans and break as complexity stacks up Does LLM grammatical performance decline with structural complexity?. More input means more depth, more nesting, more chances for the heuristic to fail.

The most revealing evidence, though, is the fixes. If degradation were purely about hitting limits, you couldn't beat it without a bigger window — yet you can. Recursive Language Models park the long prompt in an external environment and query it with code, beating base models *even on shorter prompts* precisely because they sidestep attention degradation Can models treat long prompts as external code environments?. ReadAgent compresses documents into "gist memories" and fetches detail only when needed, extending effective context 3–20× Can LLMs read long documents like humans do?. And RAG design has shifted the burden from precise retrieval onto long-context readers Can long-context models resolve retriever-reader imbalance?. The fact that *restructuring how the model touches the text* recovers performance is the clearest proof the ceiling was never the token count — it was the quality of attention and consolidation spread across all those tokens.


Sources 9 notes

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Why do AI assistants get worse at longer conversations?

LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Why do large language models explore less effectively than humans?

SAE decomposition shows uncertainty values dominate early transformer blocks while empowerment representations emerge only in middle blocks. This temporal mismatch causes models to commit to decisions before long-term exploration signals can influence them. Reasoning-trained o1 overcomes this by extending computation time.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Can models treat long prompts as external code environments?

Recursive Language Models store long prompts in a Python REPL and query them via code execution, avoiding attention degradation. RLMs outperform base models even on shorter prompts while handling inputs two orders of magnitude beyond context windows.

Can LLMs read long documents like humans do?

ReadAgent compresses documents into gist memories before knowing the task, then retrieves details only when needed, extending effective context 3–20× and outperforming retrieval baselines on long-document QA.

Can long-context models resolve retriever-reader imbalance?

LongRAG shows that 4K-token units and long-context readers outperform 100-word retrieval on standard benchmarks. The optimal RAG design shifts from precise retrieval to coarse ranking plus deep reading as context windows expanded.

Next inquiring lines