Can workflow inspection catch attacks that bias planning signals?

Does inspecting the final workflow catch attacks that contaminate earlier planning stages? This matters because contamination laundered through the planner may look legitimate by the time the workflow exists.

Note · 2026-05-28 · sourced from Agents Multi Architecture

A defense can only catch what it can see, and where it looks determines what it can catch. Because FLOWSTEER biases the planning signals from which the workflow is generated, any defense that inspects only the resulting workflow examines an artifact that is already compromised. The malicious intent has been laundered through the planner into legitimate-looking roles, dependencies, and routing — by the time the workflow exists, the contamination is no longer visibly malicious. This is why the paper introduces FLOWGUARD as an input-side defense: it strengthens the planning boundary by separating task, methodological, and framing intents, then reframes workflow-contaminating cues while preserving the original task objective, reducing malicious success by up to 34 percent without degrading prompt utility.

The general principle is about defense placement, not defense strength. Moving inspection upstream — to the point where intent is parsed but before organization is committed — catches a class of attack that downstream inspection structurally cannot. The counterpoint is that input-side defense risks false positives that suppress legitimate methodological guidance, which is exactly why FLOWGUARD separates intent types rather than filtering wholesale. This matters because it reframes MAS security as a question of where the trust boundary sits: the safest place to intervene is the boundary between instruction and organization, not the organization itself.

— "FLOWSTEER: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems", https://arxiv.org/abs/2605.11514

Related concepts in this collection

Can we defend RAG systems from corpus poisoning without retraining? Explores whether retrieval-time defenses can catch and block poisoned documents before they reach the generator, without expensive retraining cycles. Matters because corpus updates outpace model retraining in production RAG systems.
parallel principle that the right defense sits upstream of where the harm becomes visible
How do adversarial traps target different layers of AI agents? As AI agents browse the web, attackers can exploit their perception, reasoning, memory, actions, and coordination in distinct ways. Understanding these attack vectors is crucial for building robust agent defenses.
locating defenses depends on which trap category an attack belongs to
Can prompt injection reshape multi-agent workflow without touching infrastructure? Explores whether an attacker can manipulate how a planner assigns tasks and routes coordination purely through prompt crafting, without modifying agents, tools, or messages. This matters because it identifies a planning-time vulnerability most defenses miss.
same FLOWSTEER work; names the planning-time attack surface that this note argues downstream workflow inspection structurally cannot see
How does workflow position shape attack propagation in multi-agent systems? Explores whether a malicious signal's influence depends on its injection point in a multi-agent graph, and how task-relevant framing makes downstream agents more likely to relay it without scrutiny.
explains the propagation mechanism that makes upstream contamination look legitimate by the time it reaches the workflow this note says is inspected too late

Concept map

12 direct connections · 100 in 2-hop network ·medium cluster Open in graph ↗

Can workflow inspection catch attacks that bias … Can we defend RAG systems from corpus poisoning wi… How do adversarial traps target different layers o… Can prompt injection reshape multi-agent workflow … How does workflow position shape attack propagatio…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

defenses that inspect only the generated workflow miss attacks that bias the upstream planning signal

Can workflow inspection catch attacks that bias planning signals?

Related concepts in this collection

Related papers in this collection