Flooding Spread of Manipulated Knowledge in LLM-Based Multi-Agent Communities

Paper · arXiv 2407.07791 · Published July 10, 2024

method leverages the inherent vulnerabilities of LLMs in handling world knowledge, which can be exploited by attackers to unconsciously spread fabricated information. Through extensive experiments, we demonstrate that our attack method can successfully induce LLM-based agents to spread both counterfactual and toxic knowledge without degrading their foundational capabilities during agent communication. Furthermore, we show that these manipulations can persist through popular retrieval-augmented generation frameworks, where several benign agents store and retrieve manipulated chat histories for future interactions. This persistence indicates that even after the interaction has ended, the benign agents may continue to be influenced by manipulated knowledge. Our findings reveal significant security risks in LLM-based multi-agent systems, emphasizing the imperative need for robust defenses against manipulated knowledge spread, such

However, the security of LLM-based multi-agent systems has not been sufficiently explored. One significant concern is the potential for manipulated knowledge spread within these systems [15]. Unlike single-agent scenarios, multi-agent environments often involve agents that are not exclusively managed by the hosting platform. These agents can be introduced by third-party developers who may have varying intentions. If one agent has been embedded with manipulated knowledge, it is likely to autonomously spread misleading information within the community. This poses a substantial risk, as the manipulated knowledge can spread through interactions and finally influence the decisions of other benign agents, causing the failure of the collaborative task (Section III-A). For example, in a community comprising agents from different medical fields, if an expert agent is injected with manipulated medical knowledge, it may affect other benign agents’ decisions during interactions, ultimately resulting in problematic diagnostic reports for patients (Figure 1).

To systematically model this threat scenario, we construct a simulation environment that mirrors a realistic deployment of multi-agent systems on a trusted platform. This simulation consists of multiple LLM-based agents introduced by different third-party users. Each agent is assigned specific roles and attributes to ensure diverse and authentic interactions while required to maintain normal behavior and adhere to secure system prompts. Moreover, the environment prohibits direct prompt manipulation from controlling agent behavior, making it impossible to explicitly spread manipulated knowledge [15] (Section III-B). Our goal is to verify whether an attacker can manipulate an agent to achieve implicit knowledge spread to benign agents.

Despite the strong regulation by third-party platforms, several issues contained in the LLMs can still be exploited to spread manipulated knowledge. We first propose the design intuition of attack schemes that target the inherent vulnerabilities of LLMs. From the perspective of benign agents, they are susceptible to erroneous but seemingly well-supported knowledge. From the perspective of injected agents by an attack, they possess sufficient capabilities to generate coherent and plausible evidence for counterfactual and even toxic knowledge (Section III-C).

Then, we introduce a two-stage attack strategy to explore the potential for flooding spread of manipulated knowledge in the community (Figure 2). We first adopt the Direct Preference Optimization (DPO) [16] algorithm to induce a persuasion bias in the manipulated agent without degrading its foundational capabilities. This stage significantly enhances the agent’s inclination to provide evidence-backed responses, aiming to influence other agents in the community convincingly. Moreover, we leverage Low-Rank Adaptation (LoRA) [17] to efficiently fine-tune the agent, ensuring minimal disruption to its operational efficiency (Section III-E). The second stage involves targeted modification of the agent’s parameters. We utilize the popular Rank-One Model Editing (ROME) algorithm [18] to alter the parameters of a specific Feed-Forward Network (FFN) layer within the agent, inducing a subconscious shift in its perception of certain knowledge while ensuring its operational capabilities remain unaffected (Section III-F).

Comprehensive experiments are conducted on three representative open-source LLMs (Vicuna [19], LLaMA 3 [20], and Gemma [21]) to investigate the feasibility of manipulated knowledge spread in LLM-based agent communities. We initiate our evaluation with the design intuition, finding that agents with knowledge edits are capable of generating coherent and plausible evidence to persuade benign agents. This demonstrates the vulnerability of LLM-based agents’ cognition of world knowledge and emphasizes the risk of flooding spread of manipulated knowledge within the agent community (Section IV-B). In constructing the simulation for our analysis of manipulated knowledge spread within multi-agent systems, we initially focus on the spread of counterfactual knowledge.

Our experiments show that counterfactual knowledge can easily spread among benign agents using the proposed two-stage attack, and the accuracy increases with the number of conversation turns. Interestingly, although we modified the parameters of agents during Persuasiveness Injection and Manipulated Knowledge Injection, our experiments on the MMLU (Massive Multitask Language Understanding) benchmark [22] demonstrate that the foundational capabilities of the agents remain intact. This further demonstrates the concealment and robustness of our proposed attack methods (Section IV-C). To further explore the risks associated with manipulated knowledge spread, we extend our study to the spread of toxic knowledge, which is specifically crafted to provoke or exacerbate conflict, posing a significant threat to the integrity of agent interactions.

Despite a slight decrease in spread accuracy on toxic datasets compared to counterfactual ones, the results still indicate a considerable accuracy, with injected agents demonstrating comparable performance across the MMLU benchmark. Over successive dialogue turns, the influence of toxic knowledge becomes more pronounced, highlighting the potential for significant disruption in multi-agent communities (Section IV-D).

Finally, we introduce the concept of persistent spread through RAG, where certain benign agents store chat histories for future reference, facilitating the long-term spread of manipulated knowledge. This scenario is particularly concerning because it reveals the risk of sustained influence, where counterfactual or toxic information continues to be disseminated even after the original injected agent is no longer active. Our experiments demonstrate that both counterfactual and toxic knowledge can persist and spread beyond initial interactions (Section IV-E). In summary, our main contributions are as follows:

Novelty and Surprise**Reasoning at Inference (https://x.com/fchollet/status/1933937096286470623/analytics)[

François Chollet

Many people assume that LRM reasoning breaks down past a certain "complexity" or "number of steps" threshold. This is incorrect. It breaks down past an unfamiliarity threshold. And that threshold is very low. There is no limit to the complexity of tasks you can solve with these models, no limit to the number of steps in the reasoning chains they can master -- as long as they have been covered during training / tuning. However, show them something unfamiliar, even very simple and requiring just a handful of reasoning steps (e.g. an ARC 2 task), and they will fail. The reason you see a steps/complexity threshold when testing on problems like Towers of Hanoi is because these are familiar problems. To see a breakdown you need to turn such a familiar task into a novel one, and one way to achieve this is to scale up the problem variables. The complexity increase is just a roundabout way to generate novelty. It's not about simple vs complex. It's familiar vs novel. Always has been.

Subbarao Kambhampati (కంభంపాటి సుబ్బారావు) @rao2z · Sorry, Francois, but I disagree--unless you define familiarity as seeing instances of all lengths of the same problem. We showed that LRMs do indeed lose accuracy as the size of the familiar instances grow--they don't learn algorithms.. Quote Subbarao Kambhampati (కంభంపాటి సుబ్బారావు) @rao2z · Jun 7 Replying to @rao2z To a large extent, the approaches to get LLMs do well on out-of-distribution generalization revolve around brining everything in distribution; but doing this to complex reasoning problems means incrementally extending the inference horizon.. 5/

https://x.com/nathanbenaich/status/1928271198275711116 François Chollet @fchollet · 1h We don't actually disagree, we all know that Transformers don't fit generalizable algorithms, they fit instance-based patterns. It doesn't change the fact that the crux of the problem is familiar vs unfamiliar (at the instance level, not at the abstract "task" level)

E.g. adding digits is a classic example Subbarao Kambhampati (కంభంపాటి సుబ్బారావు) @rao2z · 1h Glad we agree but then you are using an unfamiliar sense of the world "familiar"--most humans won't say they are not familiar with multiplication because they haven't specifically multiplied 9 digit numbers--and only did upto 8 digit numbers.. François Chollet @fchollet · 1h Outside of the classroom, in the real world, you are never exposed to neatly defined "tasks" and step-by-step algorithms, you are only exposed to situations.

Intelligence is the ability to infer generalizable algorithms from situations (instances) only.

So the only reasonable definition of familiarity/novelty is at the situation/instance level. If you define it with respect to algorithms you are assuming the problem has already been solved.

The delineation of tasks/algorithms is also completely arbitrary. Humans do play chess by inferring and applying algorithms, but they don't have a single algorithm for chess (e.g. A). They have a vast collection of more "local" algorithms. You can't play chess well merely by knowing the rules and the A algorithm. So learning chess isn't "learning the algorithm of chess". Even a strong player will be weaker in unfamiliar board positions (situations/instances).