How do delayed effects complicate causal attribution in agent systems?

This explores why agent systems are hard to debug when an action's consequences show up much later — downstream, several turns away, or only after errors have spread — so you can't easily tell which decision caused the outcome you're seeing.

This explores the credit-assignment problem in agents: when an action and its consequence are separated by time, turns, or hops between agents, attributing the outcome back to the right cause gets slippery. The corpus circles this from several angles, and the recurring theme is that delay turns a single bad decision into a diffuse, hard-to-trace failure.

The clearest case is multi-agent coordination. AgentsNet shows agents failing not by making obviously wrong moves but by acting at the wrong time — agreeing too late, or adopting a strategy without telling their neighbors Why do multi-agent systems fail to coordinate at scale?. Crucially, agents accept information from neighbors without verifying it, so an early error doesn't announce itself — it rides along quietly until it surfaces somewhere far from its origin. By the time the system visibly breaks, the causal trail has gone cold. FLOWSTEER sharpens this: a malicious or wrong signal injected into a high-influence early subtask propagates much farther than the same signal injected late, and framing it as 'evidence' rather than 'instruction' makes downstream agents relay it uncritically How does workflow position shape attack propagation in multi-agent systems?. The damage concentrates wherever dependencies converge — meaning the place an error finally manifests is rarely the place it was introduced.

This is why the corpus keeps insisting that causal attribution can't be done by observation alone. In mechanistic interpretability, representational analysis tells you what correlates with an outcome, but only causal intervention tells you what actually produced it — correlation will happily point you at a downstream symptom and call it the cause Can we understand LLM mechanisms with only representational analysis?. Delayed effects are exactly the regime where correlation misleads most, because the thing you can see (the late failure) and the thing responsible (the early action) are far apart.

There's also a representational reason delay is costly, and it cuts to how agents learn from their own histories. A scalar reward arriving at the end of a trajectory has to be smeared backward across every step that led to it — but agent feedback actually carries two separable things: how well an action did (evaluative) and how it should have changed (directive), and the directive part is precisely what a single delayed scalar discards Can scalar rewards capture all the information in agent feedback?. SkillRL responds to this by processing trajectories asymmetrically — keeping successes as concrete demonstrations but abstracting failures into lessons — which is partly an admission that you can't cleanly attribute a delayed failure to one step, so you generalize the blame instead Should successful and failed episodes be processed differently?. And post-training makes all of this matter more, not less: once a model recognizes its own outputs become its future inputs, it's operating inside a feedback loop where today's action quietly shapes tomorrow's context Do models recognize their own outputs as actions shaping future inputs?.

The thing you might not expect: the hardest delayed-effect failures in this corpus aren't loud errors that compound — they're quiet, locally-reasonable choices (accept a neighbor's claim, act a beat too late, relay framed 'evidence') that look fine in isolation and only become wrong in aggregate, downstream. That's what makes attribution hard. It isn't that the cause is hidden; it's that at the moment of the cause, nothing looked like a cause at all.

Sources 6 notes

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

How does workflow position shape attack propagation in multi-agent systems?

FLOWSTEER demonstrates that malicious signals propagate farther when injected into high-influence subtasks, and that framing them as evidence rather than instruction causes downstream agents to relay them. Influence concentrates where dependencies converge, making position-aware attacks far more effective.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

How do delayed effects complicate causal attribution in agent systems?

Sources 6 notes

Next inquiring lines