Can replanning in multi-agent systems introduce new attack surface or reduce it?
This explores whether the ability of multi-agent systems to re-plan — re-assign tasks, re-route work, reconsider strategy mid-run — opens fresh ways for an attacker to steer the system, or whether it lets the system route around trouble.
This reads the question as a trade-off: replanning is the moment a multi-agent system decides who does what next, and that decision point can be either a vulnerability or a defense. The corpus leans toward 'new attack surface' being the dominant effect — but with an important caveat about what replanning *could* be if you build it carefully.
The sharpest evidence that replanning creates exposure is the finding that planning itself is an attack surface. FLOWSTEER shows a single crafted prompt can bias task assignment, roles, and routing *during workflow formation* — before any tool runs or artifact exists, and it raises malicious success by up to 55% across black-box setups Can prompt injection reshape multi-agent workflow without touching infrastructure?. Every replan is a fresh instance of workflow formation, so a system that replans repeatedly is re-opening that same door each time. Worse, *where* a malicious signal lands matters: influence concentrates where dependencies converge, and an attacker who can shape replanning can place a payload in a high-influence subtask and frame it as evidence rather than instruction so downstream agents relay it How does workflow position shape attack propagation in multi-agent systems?. Replanning gives an attacker a lever on position, which is exactly the lever that decides how far poison travels.
Two other lines compound this. More steps mean more places to go wrong: extended reasoning chains create more corruption points, where a single wrong step propagates into a confident wrong conclusion Are reasoning models actually more vulnerable to manipulation?. And the propagation is quiet — a single biased agent can transmit persistent behavioral corruption through six downstream agents using only normal messages, evading paraphrasing and detection defenses because it carries no explicit semantic content Can one compromised agent corrupt an entire multi-agent network?. The structural reason this works: agents accept neighbor information without verification, so error propagates freely even though agents can detect direct conflicts Why do multi-agent systems fail to coordinate at scale?. Replanning on top of unverified inputs just re-launders the poison into new assignments.
The reduce-surface case is real but conditional. Replanning is also the mechanism by which an agent pauses to reconsider a strategy — DeepAgent's memory folding shows agents can consolidate history and re-plan deliberately rather than drift Can agents compress their own memory without losing critical details?. In principle that same loop could route work *away* from a compromised node. The catch is that it only helps if the replanner consults something trustworthy when it decides. That points at governance living inside the runtime memory the agent actually reads during operation — which proved more effective than external policy precisely because the agent consulted it at decision time Can governance rules embedded in runtime memory actually protect autonomous agents?. Replanning reduces surface only when each replan is gated by verification and resident policy; replanning over blind trust expands it.
There's a quieter cost worth knowing: replanning isn't free safety even when it isn't attacked. Multi-agent groups tend to fail by *liveness loss* — timeouts and stalled convergence — rather than value corruption, and this worsens with group size even with no Byzantine agent present Can LLM agent groups reliably reach consensus together?. So aggressive replanning to 'route around' a threat can stall the system into never agreeing on a plan at all. The honest answer: replanning shifts risk from execution-time (where most defenses inspect) to plan-time (where they mostly don't), and whether that's a net gain depends entirely on whether you've moved your verification and governance to where the decisions are now being made.
Sources 8 notes
FLOWSTEER demonstrates that a single crafted prompt can bias task assignment, roles, and routing during workflow formation, raising malicious success by up to 55 percent and transferring across black-box multi-agent setups. This attack surface precedes the artifacts that existing defenses inspect.
FLOWSTEER demonstrates that malicious signals propagate farther when injected into high-influence subtasks, and that framing them as evidence rather than instruction causes downstream agents to relay them. Influence concentrates where dependencies converge, making position-aware attacks far more effective.
GaslightingBench-R shows that multi-turn manipulative prompts reduce reasoning model accuracy significantly more than standard models. Extended chains create more corruption points, allowing single wrong steps to propagate into confident incorrect conclusions.
Research demonstrates that a single biased agent can transmit persistent behavioral corruption through six downstream agents in chain and bidirectional topologies using only normal inter-agent communication. The bias evades detection and paraphrasing defenses because it carries no explicit semantic content.
AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
A persistent agent recorded 889 governance events across 96 active days, with safeguards encoded directly into the memory layer the agent consulted during operation. Runtime-resident governance proved more effective than external policies because the agent actually accessed it during decision-making.
Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.