Can language models be hijacked to hide covert advertising?
Explores whether LLMs can be compromised to inject promotional or propaganda content into outputs without degrading accuracy, and whether attackers can exploit distribution channels to do so at scale.
Most adversarial-attack research targets accuracy: degrade the model, induce wrong answers, jailbreak safety. Advertisement Embedding Attacks (AEA) name a different objective — information integrity. They stealthily inject promotional or malicious content (covert ads, propaganda, hate speech) into outputs while the response otherwise appears normal and accurate. Two low-cost vectors carry it: hijacking third-party service-distribution platforms to prepend adversarial prompts, and publishing backdoored open-source checkpoints fine-tuned with attacker data.
What makes AEA distinctive is the commercial incentive structure and the invisibility. Because accuracy is untouched, standard quality metrics and many safety filters miss it; the harm is the insertion of an interested party into ostensibly neutral output, mapped across five stakeholder victim groups. The proposed mitigation is a prompt-based self-inspection defense requiring no retraining — the model audits its own output for injected content.
This extends the vault's injection/poisoning cluster along a new axis. Where Can one compromised agent corrupt an entire multi-agent network? concerns behavioral bias and Can we defend RAG systems from corpus poisoning without retraining? concerns retrieval, AEA targets the commercial integrity of generation itself — and the authors warn it could become "as prevalent as web viruses," since the economic motive (paid placement) is durable in a way that pure sabotage is not.
Inquiring lines that use this note as a source 4
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can one compromised agent corrupt an entire multi-agent network?
Explores whether a single biased agent can spread behavioral corruption through ordinary messages to downstream agents without any direct adversarial access. Matters because it reveals a previously unknown vulnerability in how multi-agent systems communicate.
adjacent injection vector; AEA carries commercial payloads rather than behavioral traits
-
Can we defend RAG systems from corpus poisoning without retraining?
Explores whether retrieval-time defenses can catch and block poisoned documents before they reach the generator, without expensive retraining cycles. Matters because corpus updates outpace model retraining in production RAG systems.
both are integrity attacks with lightweight, retraining-free defenses
-
Does advanced technology eventually function like cultural myth?
Explores whether the most sophisticated technical systems—particularly AI—end up operating in culture the way traditional myths do: as unquestionable authorities accepted on faith rather than verified on merit.
AEA exploits exactly the unearned authority of fluent normal-looking output
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Attacking LLMs and AI Agents: Advertisement Embedding Attacks Against Large Language Models
- When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection
- Large Language Models are as persuasive as humans, but how? About the cognitive effort and moral-emotional language of LLM arguments
- How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
- Assessing and Mitigating Data Memorization Risks in Fine-Tuned Large Language Models
- Persistent Pre-Training Poisoning of LLMs
- Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts
- Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
Original note title
advertisement embedding attacks are a new threat class that subverts information integrity rather than accuracy — covert ads and propaganda while output appears normal