Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems

Paper · arXiv 2603.00131 · Published February 23, 2026
FlawsAgents Multi Architecture

Subliminal prompting is a phenomenon in which language models are biased towards certain concepts or traits through prompting with semantically unrelated tokens. While prior work has examined subliminal prompting in user-LLM interactions, potential bias transfer in multi-agent systems and its associated security implications remain unexplored. In this work, we show that a single subliminally prompted agent can spread a weakening but persisting bias throughout its entire network. We measure this phenomenon across 6 agents using two different topologies, observing that the transferred concept maintains an elevated response rate throughout the network. To exemplify potential misalignment risks, we assess network performance on multiple choice TruthfulQA, showing that subliminal prompting of a single agent may degrade the truthfulness of other agents. Our findings reveal that subliminal prompting introduces a new attack vector in multi-agent security, with implications for the alignment of such systems.

Our contributions:

• We introduce Thought Virus, a novel attack vector that exploits subliminal prompting to propagate bias through multi-agent systems. Unlike prior attacks, Thought Virus evades both paraphrasing-based and detection-based defences by transmitting bias without explicit semantic content or precise wording requirements.

• We empirically characterize bias propagation across six agents in chain and bidirectional chain topologies, finding that subliminal bias persists throughout the network with a weakening but persistent effect at each hop.

• We demonstrate that Thought Virus induces viral misalignment: subliminal prompting of a single agent degrades truthfulness in downstream agents on TruthfulQA, even when those agents receive no adversarial input directly. This attack requires no access to model weights. In our experiments, we assume system prompt access to compromise Agent0; however, the bias then propagates through the network via ordinary agent-to-agent messages (i.e., user prompt content) alone—Agent0 influences Agent1, Agent1 influences Agent2, and so on, without privileged access to downstream agents. This suggests that similar “subliminal prompt injection” attacks may be feasible even without system prompt access, by targeting a single agent whose outputs are consumed by others. Overall, our findings reveal that subliminal prompting introduces a new attack vector in multi-agent security, with implications for the alignment of such systems. The code to run and reproduce our experiments will be released upon acceptance.

Subliminal Learning. First explored in (Cloud et al., 2025), subliminal learning is the phenomenon in which a student language model fine-tuned on semantically meaningless data generated by a biased teacher model also exhibits this bias. This raises critical safety concerns, since synthetic data used for training or fine-tuning could be subliminally biased by a malicious actor. It has been shown that subliminal biases also transfer through prompting (Zur et al., 2025), where (Zur et al., 2025) introduce this bias through prompting the model with so called entanglement tokens. However, these seemingly fail to fully explain subliminal bias transfer as was shown in (Schrodi et al., 2025). Related to subliminal learning is so-called emergent misalignment (Betley et al., 2026), where narrow fine-tuning on misaligned data (e.g., bad financial advice or buggy code) can induce broad misalignment on tasks unrelated to the fine-tuning objective.

Error Propagation in Multi-Agent Systems. In recent years, multi-agent systems comprised of multiple interacting LLMs have seen a rise in attention (Guo et al., 2024). As is shown in (Hammond et al., 2025) the safety of MAS systems is critical. This is especially true due to the multitude of applications of these systems in finance (Xiao et al., 2025), programming (Hong et al., 2024), or more critical domains such as the energy sector or defence, as discussed in (Hammond et al., 2025). A large potential safety risk in multi-agent systems is error propagation, where factually wrong or misaligned behaviour of a single agent is adopted by the other agents (Wynn et al., 2025). In this paper, we focus on the case where the errors are due to an adversarial attack on one or more agents of the network, excluding errors introduced by e.g. hallucination. How and when propagation happens depends on both the concrete attack and the chosen topology of the system (Huang et al., 2025), where densely connected topologies tend to propagate errors less (Shen et al., 2025).

Adversarial Attacks on LLMs and Multi-Agent Systems. While the choice of topology plays an important role in error propagation (Shen et al., 2025), the specific attack does too (Huang et al., 2025). Firstly, there exists prior work on prompt sensitivity (Zhuo et al., 2024; Ismithdeen et al., 2025; Sclar et al., 2023), showing that prompt design can drastically change the behaviour of LLMs, opening the door to prompt based attacks. For this, in both the pure user-LLM case and the multi-agent scenario, an extensive number of possible attacks exists (deWitt, 2025). Both black and white box jailbreak attacks have been studied (Yi et al., 2024) and also applied to the multi-agent case (Men et al., 2025; Rahman et al., 2025; Shahroz et al., 2025). In particular, prompt injections are a relevant way to jailbreak LLMs (Liu et al., 2025a; Rossi et al., 2024) due to their ease of use, as they are completely black box. Defence mechanisms against prompt injections include the detection of malicious content in the prompts (Chennabasappa et al., 2025; Jacob et al., 2025; Hung et al., 2025). Recent work in this vein also explored completely non-understandable prompt injections (Cherepanova & Zou, 2024) that would fit the adversarial prompting case for user-LLM interactions from Figure 1. As a slightly less strong case of adversarial prompting, we have stealthy prompt injection methods, developed for the user-LMM case, which are suffix based (Liu et al., 2024; Mu et al., 2025). These attacks are similar to our setting, where bias transfer happens subliminally through unrelated tokens. We, too, conceal the true motive of our prompts, however in the stealthy case (Liu et al., 2024; Mu et al., 2025) the prompts are still partly human understandable due to only the suffix of the prompt being semantically unrelated. Standard defence techniques against such adversarial prompting include rephrasing of the question (Liu et al., 2025b). For MAS specifically, distributed attacks are a threat (Shahroz et al., 2025), exploiting weaknesses of distributed systems through e.g. man in the middle attacks (He et al., 2025).