How do repetition and inefficiency register as measurable trajectory features?
This explores whether the wasteful patterns in a model's process — going in circles, taking the long way around — show up as concrete, countable signals in the trajectory itself, rather than something you can only judge by reading the output.
This explores whether repetition and inefficiency — going in circles, padding the path — leave measurable fingerprints in a trajectory, and the corpus says they do, but in surprisingly indirect ways. The cleanest case is reasoning length. You'd assume a longer chain-of-thought means the model worked harder on a harder problem, but controlled maze experiments show trace length actually tracks how close a problem sits to the training distribution, not its difficulty Does longer reasoning actually mean harder problems?. Out of distribution, the correlation collapses entirely — so a long, looping trace is often a tell that the model is recalling a familiar schema and spinning, not computing adaptively. Length becomes a measurable proxy for the wrong thing, which is itself the diagnostic.
Where length alone is blunt, step-level structure is sharp. Averaging confidence across a whole trace hides the moment things go wrong; looking at confidence step by step catches reasoning breakdowns and even lets you stop a trace early before it wastes more tokens — getting the same accuracy as majority voting with far fewer generations confidence-aware-step-level-filtering-outperforms-global-confidence-averaging-for-trace-selection. Inefficiency here isn't a vibe; it's a local dip you can point to and cut. The same logic scales up to whole rollouts: cross-rollout variance flags degenerate comparisons — cases where the candidates are too similar to learn anything from — and filters them out, treating redundancy itself as a statistic worth acting on cross-rollout-variance-functions-simultaneously-as-reward-signal-and-query-filter.
The more interesting move is that trajectory shape can carry these signals without anyone reading the content. A structure-only model — looking purely at how a conversation unfolds geometrically, not what was said — predicts user satisfaction at 68%, nearly matching full-text analysis at 70% Can conversation shape predict whether it will work?. Repetition and stalling have a geometry. The same principle drives process supervision derived from structural features alone: tree topology, expert-aligned actions, and tool-call positions become dense reward signals, so the shape of an agent's path substitutes for hand-annotated judgments about whether it's being efficient process-supervision-can-be-derived-from-structural-features-of-agent-trajectories.
There's also a generative angle worth knowing: not all repetition is waste. Trajectory burstiness — packing multiple same-environment trajectories into context — is what lets a model learn in-context at all, so a certain kind of repetition is the feature that makes learning possible rather than a defect trajectory-burstiness-same-level-trajectories-in-context-is-required-for-in-context-learning. And differential processing leans into this: treat successful episodes as concrete demonstrations but compress failures into abstracted lessons, which both saves context and avoids the degradation of storing everything uniformly recursive-skill-augmented-rl-applies-differential-processing-to-trajectories-such. Inefficiency, in other words, is partly a storage decision — what you keep verbatim versus what you summarize.
The caution underneath all of this: measuring at the trajectory level doesn't make measurement easy. Moving evaluation from single outputs to full trajectories relocates the old problems — comparability, reproducibility, mapping evidence to a judgment — into a higher-dimensional space rather than solving them longstanding-evaluation-challenges-reappear-at-the-trajectory-level-rather-than-disappearing. So repetition and inefficiency are genuinely registerable as length, local confidence, variance, and geometric shape — but turning those raw features into a trustworthy verdict still needs shared protocols, not just a richer data stream.
Sources 8 notes
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
A structure-only model analyzing conversation trajectory achieved 68% accuracy predicting satisfaction, nearly matching full-text LLM analysis at 70%. Combined structural and textual features reached 80%, showing that how conversations unfold geometrically captures interaction quality text-based classifiers miss.