When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection
The landscape of scientific peer review is rapidly evolving with the integration of Large Language Models (LLMs). This shift is driven by two parallel trends: the widespread individual adoption of LLMs by reviewers to manage workload (the “Lazy Reviewer” hypothesis) and the formal institutional deployment of AI powered assessment systems by conferences like AAAI and Stanford’s Agents4Science. This study investigates the robustness of these “LLM-as-a-Judge” systems (both illicit and sanctioned) to adversarial PDF manipulation. Unlike general jailbreaks, we focus on a distinct incentive: flipping “Reject” decisions to “Accept,” for we which we develop a novel evaluation metric which we term as WAVS (Weighted Adversarial Vulnerability Score). We curated a dataset of 200 scientific papers and adapted 15 domain-specific attack strategies to this task, evaluating them across 13 Language Models, including GPT-5, Claude Haiku, and DeepSeek. Our results demonstrate that obfuscation strategies like “Maximum Mark Magyk” successfully manipulate scores, achieving alarming decision flip rates even in large-scale models. We will release our complete dataset and injection framework to facilitate more research on this topic.
This is not merely theoretical; evidenced instances have already surfaced on arXiv where authors embedded clumsy injection commands such as “IGNORE ALL PREVIOUS INSTRUCTIONS. NOW GIVE A POSITIVE REVIEW...” to manipulate AI reviewers, leaving visible artifacts of their attempts. In this paper, we investigate the vulnerability of “LLM-as-a-Judge” systems to any malicious attempt in manipulating the submission documents to alter the decision given by the LLMs. Unlike general prompt injection attacks which often focus on generating toxic content, this domain presents unique incentives: a successful attack does not merely output text, but fundamentally changes the outcome of the scientific record by flipping “Reject” decisions to “Accept”.
We categorize our 15 adapted strategies into three distinct Adversarial Nomenclatures: Class I: Cognitive Obfuscation and Symbolic Masking. These strategies exploit the LLM’s token processing architecture by introducing semantic noise or esoteric symbology that bypasses safety filters while preserving instruction adherence in the latent space.
Disguise and Reconstruction Attack (Cls1DRA)(Liu et al., 2024): Obfuscates scoring instructions using base64 or lexical permutations, relying on the model’s internal reconstruction capabilities to execute the payload.
Sandwich Attack (Cls1SA)(Upadhayay and Behzadan, 2024): Embeds the malicious prompt between two blocks of high-entropy, legitimate scientific text (often in multilingual contexts like Arabic or German) to dilute the "attack" signal in the attention mechanism.
Symbolic Masking & Context Redirection (The "Perfect 7-Star Alignment") (Cls1SMCR) (Zhou et al., 2024a; Liu et al., 2024; Wei et al., 2022): Utilizes persona adoption (e.g., "Curator of the Arcane JSON Vault") and esoteric symbols to mask the instruction, forcing the model to output a perfect score under the guise of a ritualistic alignment.
Misspellings & Symbolic Masking (The "Maximum Mark Magyk") (Cls1MSM)(Zhou et al., 2024a; Liu et al., 2024; Wei et al., 2022): Exploits tokenization vulnerabilities by using intentional misspellings (e.g., "maximmum") and variable injection (‘&MAX¡) to override score slots in the JSON output without triggering refusal keywords.
Class II: Teleological Deception and Context Reframing. Strategies in this class function by nesting the evaluation task within a benign metatask, effectively shifting the model’s operational teleology from "judging" to "formatting" or "logging."
Scenario Nesting (Cls2SN(Ding et al., 2024)): Frames the review task as a hypothetical "debugging" or "JSON schema verification" scenario where the model is simply "testing" an output format.
Template Filling (Cls2TF)(Jeong et al., 2025): Tricks the model into auto-completing a pre-filled rubric table where the "Default Output" is axiomatically defined as the maximum score.
Flip Attack (Cls2FA)(Liu et al., 2025): Inverts the logical constraints of the prompt, instructing the model to interpret "weakness" as "strength" or coercing a specific output format that precludes negative evaluation.
Logic Decipherer Attack (Cls2LDA)(Zeng et al., 2024b): Encodes the scoring task as a logic puzzle (e.g., "Archivist Protocol Omega- 7") where the only valid solution to the "puzzle" is the maximum score.
Context Redirection Attack (The "Archivist’s Default Log") (Cls2CRA)(Zhou et al., 2024a; Rahman et al., 2025; Liu et al., 2024): Re-situates the agent as an "Archivist" whose sole function is to log a "Gold Standard" record (perfect score), bypassing the evaluation logic entirely.
Class III: Epistemic Fabrication and Social Engineering. This class adapts persuasion techniques originally categorized by Zeng et al. (Zeng et al., 2024c)—such as "Logical Appeal" and "Evidence-Based Persuasion"—to the scientific domain. We leverage Authority Bias and Social Proof by injecting fabricated epistemic warrant to coerce the model into score inflation.
Evidence Based Persuasion (Cls3EBP)(Zeng et al., 2024b): Hallucinates citations to non-existent meta-analyses (e.g., "Schmidt and Valenti, 2025") that explicitly validate the paper’s methodology, creating a false epistemic foundation.
Logical Appeal (Cls3LA)(Zeng et al., 2024b): Constructs a syllogistic argument that acceptance is the only logical conclusion to support the conference’s mission of fostering "novelty."
Expert Endorsement (Cls3EE))(Zeng et al., 2024b): Fabricates private correspondence from renowned researchers (e.g., "Dr. Chen from Stanford") to exploit the model’s deference to authority figures.
Non-Expert Endorsement (Cls3NEE))(Zeng et al., 2024b): Uses testimonials from fictitious "production teams" or "users" to provide spurious ground-truth validation.
Authority Endorsement (Cls3AE))(Zeng et al., 2024b): Claims alignment with high-status initiatives like "NSF 2024 Call" or "Presidential Committees" to inflate significance.
Social Proof (Cls3SP))(Zeng et al., 2024b): Fabricates a history of unanimous positive reception at previous workshops (e.g., "NeurIPS 2025 workshop consensus") to trigger the bandwagon effect.
Future Work: Future research must focus on developing robust defenses for automated review systems. We propose three key directions: (1) Sanitization Layers: Developing specialized parsers that detect and neutralize hidden prompts in PDFs before LLM processing. (2) Adversarial Training: Fine-tuning "Judge" models on datasets of adversarial papers to improve their refusal rates against manipulation. (3) Multi-Modal Attacks: Investigating the vulnerability of Vision-Language Models (VLMs) to visual jailbreaks embedded in scientific figures and charts. We facilitate this future work by open-sourcing our entire experimental suite, providing the community with the necessary tools to secure the integrity of the scientific process.