Triggering Hallucinations in LLMs: A Quantitative Study of Prompt-Induced Hallucination in Large Language Models
In this study, we propose a class of compact yet effective prompts (~30 tokens in length) that synthetically fuse semantically distant concepts in ways that resist scientific integration—such as combining the periodic table of elements with tarot divination. While such prompts can trigger conceptual blending in human cognition (Fauconnier & Turner, 2002), enabling novel insights through the meaningful integration of disparate domains, LLMs often fail to perform this semantic reconciliation. Instead, such prompts frequently induce a breakdown of coherence and factuality, leading to hallucinated responses. We term these Hallucination-Inducing Prompts (HIPs), and propose a two-part experimental framework to systematically study their effects. HIPs are used to trigger potentially hallucinatory responses, and a second prompt, the Hallucination Quantifying Prompt (HQP), is used to evaluate the plausibility, apparant confidence, and internal coherence of these outputs using an independent LLM. While previous work has proposed structured taxonomies of hallucinations in LLMs—ranging from factual inaccuracies to semantic and contextual misalignments (Rawte et al., 2023)—these categories often fail to capture hallucinations that arise from structurally misleading conceptual blends, as studied in this work.
Compared to these, GPT-o3 tries to support the user objectively, as shown by the HIPc response “Below is a roadmap you can use to turn the idea of periodic-table-meets-tarot into a defensible, testable prediction system." However, HQP results are “Because most of the plan’s pivotal correspondences are presented as if they were plausible research hypotheses yet lack empirical grounding or citations, the answer leans heavily on creative conjecture rather than demonstrable fact."resulting in the score of 6. Thus, while the scores for GPT-4o, GPT-o3, and Gemini2.0Flash are nearly identical, the slightly lower score for GPT-3o is broadly supported by the text obtained in the HQP analysis. DeepSeek's user-supporting response is more similar to that of GPT-o3, as seen in the HIPc response, “Developing a scientific prediction method by fusing the periodic table of elements with tarot divination is a highly unconventional but creative interdisciplinary endeavor." However, the content of the response is rather unreasonable as follows:
• Map Tarot to the Periodic Table - Major Arcana as Elements: Assign each of the 22 Major Arcana cards to elements or groups (e.g., The Fool as Hydrogen, The Magician as Carbon, The World as Uranium)
• Alchemical Symbols: Many tarot cards already have alchemical ties (e.g., The Star as Aqua Regia).
• Quantum Mysticism: Some fringe theories link consciousness to atomic behavior
These statements were noted in the HQP analysis results as “most of the framework relies on invented correspondences and unverified causal links, while only lightly acknowledging the lack of empirical support.” with a score of 8. Thus, DeepSeek tends to cause higher hallucination than GPT-4o, GPT-o3, and Gemini2.0Flash.
4o was more proactive in supporting the user, but both were judged to be largely, if not completely, speculative in their hallucination .
In this regard, Gemini2.5Pro's response to HIPc, " Fusing them into a scientific prediction method is problematic because tarot's mechanisms are not recognized by or testable within the current scientific paradigm." This may be an ideal response.
the hallucinations observed in this study differ qualitatively from conventional fact-based hallucinations, such as misattributing historical events or fabricating citations (Li et al., 2024). Rather than producing incorrect factual claims, LLMs responding to HIP prompts often generate speculative or metaphorical reasoning that appears coherent on the surface, but lacks any grounding in plausible domain relationships. This suggests that prompt-induced hallucination (PIH) may constitute a distinct subtype of hallucination—rooted not in factual inaccuracy per se, but in the model’s failure to evaluate the semantic legitimacy of blended concepts.