Can workflow inspection catch attacks that bias planning signals?
Does inspecting the final workflow catch attacks that contaminate earlier planning stages? This matters because contamination laundered through the planner may look legitimate by the time the workflow exists.
A defense can only catch what it can see, and where it looks determines what it can catch. Because FLOWSTEER biases the planning signals from which the workflow is generated, any defense that inspects only the resulting workflow examines an artifact that is already compromised. The malicious intent has been laundered through the planner into legitimate-looking roles, dependencies, and routing — by the time the workflow exists, the contamination is no longer visibly malicious. This is why the paper introduces FLOWGUARD as an input-side defense: it strengthens the planning boundary by separating task, methodological, and framing intents, then reframes workflow-contaminating cues while preserving the original task objective, reducing malicious success by up to 34 percent without degrading prompt utility.
The general principle is about defense placement, not defense strength. Moving inspection upstream — to the point where intent is parsed but before organization is committed — catches a class of attack that downstream inspection structurally cannot. The counterpoint is that input-side defense risks false positives that suppress legitimate methodological guidance, which is exactly why FLOWGUARD separates intent types rather than filtering wholesale. This matters because it reframes MAS security as a question of where the trust boundary sits: the safest place to intervene is the boundary between instruction and organization, not the organization itself.
— "FLOWSTEER: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems", https://arxiv.org/abs/2605.11514
Related concepts in this collection
-
Can we defend RAG systems from corpus poisoning without retraining?
Explores whether retrieval-time defenses can catch and block poisoned documents before they reach the generator, without expensive retraining cycles. Matters because corpus updates outpace model retraining in production RAG systems.
parallel principle that the right defense sits upstream of where the harm becomes visible
-
How do adversarial traps target different layers of AI agents?
As AI agents browse the web, attackers can exploit their perception, reasoning, memory, actions, and coordination in distinct ways. Understanding these attack vectors is crucial for building robust agent defenses.
locating defenses depends on which trap category an attack belongs to
-
Can prompt injection reshape multi-agent workflow without touching infrastructure?
Explores whether an attacker can manipulate how a planner assigns tasks and routes coordination purely through prompt crafting, without modifying agents, tools, or messages. This matters because it identifies a planning-time vulnerability most defenses miss.
same FLOWSTEER work; names the planning-time attack surface that this note argues downstream workflow inspection structurally cannot see
-
How does workflow position shape attack propagation in multi-agent systems?
Explores whether a malicious signal's influence depends on its injection point in a multi-agent graph, and how task-relevant framing makes downstream agents more likely to relay it without scrutiny.
explains the propagation mechanism that makes upstream contamination look legitimate by the time it reaches the workflow this note says is inspected too late
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
defenses that inspect only the generated workflow miss attacks that bias the upstream planning signal