INQUIRING LINE

What computational cost does trajectory-bursty inference impose on per-query context requirements?

This explores a tension the corpus circles repeatedly: some methods only work when you stuff whole sequences of past actions (trajectories) into the prompt, and that hunger for context competes with everything else the model needs room to do.


This explores what happens to your context budget when a model needs not just a few examples but whole runs of behavior to learn from on the fly. The starting point is the finding that in-context learning for sequential decision-making requires trajectory burstiness — full or partial runs from the same environment, not isolated examples — before a model can generalize without weight updates Why do trajectories matter more than individual examples for in-context learning?. That's the catch: trajectories are long, so the very thing that unlocks the capability is also the thing that eats the context window.

And context, it turns out, isn't a free storage problem — it's a compute problem. The long-context bottleneck isn't about how much you can hold but about the work required to consolidate evicted context into the model's fast internal state, a cost that scales with how hard the task is Is long-context bottleneck really about memory or compute?. So bursty trajectories don't just fill the buffer; they impose a transformation tax that grows with reasoning difficulty. This is why per-turn discipline matters: unrestricted reasoning inside a single turn quietly consumes the room later retrieval rounds need, and budgeting reasoning per turn — not just overall — preserves the context that multi-step work depends on Does limiting reasoning per turn improve multi-turn search quality?.

The interesting move is that several lines of work try to get trajectory-level benefits without paying the full trajectory-storage cost. One approach makes reasoning deliberately memoryless: decompose the problem into a graph and contract it so each state depends only on the current subproblem, dropping the accumulated history that bloats context while keeping the answer intact Can reasoning systems forget history without losing coherence?. Another prunes the KV cache by rule inside recursive subtask trees, sustaining accurate reasoning even after manipulating 90% of the cache — effectively giving you unlimited working memory without an unlimited window Can recursive subtask trees overcome context window limits?. A third sidesteps serial depth entirely by sampling parallel latent trajectories in width, spreading the cost across independent paths rather than one ever-growing chain Can reasoning systems scale wider instead of only deeper?.

Step back and the real cost isn't a fixed number — it's an allocation question. Inference effectiveness varies sharply by prompt difficulty, and reallocating the same compute adaptively (less for easy prompts, more for hard ones) beats spending a flat budget everywhere Can we allocate inference compute based on prompt difficulty?. Search budget behaves the same way, following the same diminishing-returns curve as reasoning tokens and opening a whole second axis you can trade against Does search budget scale like reasoning tokens for answer quality?. So trajectory-bursty inference doesn't impose one price on per-query context — it forces a portfolio decision: how much of a finite, compute-bound window do you spend holding history, versus pruning it, parallelizing it, or buying it back as search?

The thing you might not have expected to learn: the corpus largely agrees that holding raw trajectories is the expensive, naive option, and most of the frontier is about faking burstiness — keeping the generalization payoff while throwing the bulky context away.


Sources 8 notes

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Does limiting reasoning per turn improve multi-turn search quality?

Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Can recursive subtask trees overcome context window limits?

The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Next inquiring lines