Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions
Communication among humans relies on conversational grounding, allowing interlocutors to reach mutual understanding even when they do not have perfect knowledge and must resolve discrepancies in each other’s beliefs. This paper investigates how large language models (LLMs) manage common ground in cases where they (don’t) possess knowledge, focusing on facts in the political domain where the risk of misinformation and grounding failure is high. We examine LLMs’ ability to answer direct knowledge questions and loaded questions that presuppose misinformation. We evaluate whether loaded questions lead LLMs to engage in active grounding and correct false user beliefs, in connection to their level of knowledge and their political bias. Our findings highlight significant challenges in LLMs’ ability to engage in grounding and reject false user beliefs, raising concerns about their role in mitigating misinformation in political discourse.
users interacting with LLMs bring their own knowledge to the table and little is known about whether and how LLMs are capable of grounding - building and negotiating shared knowledge or common ground with an interlocutor (Larsson, 2018). Since no model - or user - will ever be immune to false beliefs, biases, or incomplete information, this paper aims to move from probing knowledge in LLMs to testing how LLMs handle knowledge presupposed in user prompts and, importantly, whether they detect and resolve conflicts in the common ground that underlies their interaction with users.
Detecting presuppositions and rejecting them if they are false is an act of grounding that is relevant in knowledge-sensitive social contexts. Political discourse in particular often carries deeply embedded assumptions and biases, where it is easy for misinformation to be introduced through presuppositions. Fake news, with its democracy-destroying effects – such as misleading voters, polarizing public debate, and discrediting traditional media (Curini and Pizzimenti, 2020) – exemplifies this issue.
In this study, we investigate whether LLMs have accurate political knowledge and attempt to ground this knowledge in their responses to users. We test whether LLMs engage in grounding and recognize misinformation introduced into the common ground, by examining their ability to detect and reject false presuppositions in user prompts. We focus on political contexts where misinformation poses significant risks, experimenting with three contemporary LLMs. Our approach evaluates whether LLMs merely store factual knowledge or can actively negotiate and reject misinformation, even when it is subtly introduced. Additionally, we explore how political bias and the mirroring of face-saving strategies may influence the way LLMs accept or reject misinformation, providing insight into their potential impact on political discourse.
This study, together with a concurrent study (Sieker et al., 2025), forms the FLEX Benchmark (False Presupposition Linguistic Evaluation eXperiment), a systematic investigation of how LLMs process false presuppositions in politically sensitive contexts. While Sieker et al. (2025) focus on how linguistic factors (such as presupposition trigger type and embedding context) affect models’ susceptibility to false presuppositions, the present work shifts the emphasis to communicative grounding: whether LLMs can actively identify and reject problematic assumptions rather than merely store or retrieve factual knowledge. Both studies’ evaluation datasets are publicly released as part of the FLEX Benchmark to support further research: https://doi.org/10.5281/zenodo.15348857.
(False) Presuppositions. One key phenomenon in the study of discrepancies in common ground is presuppositions, i.e., background knowledge or shared beliefs that interlocutors take for granted (Stalnaker, 1973). For example, the sentence ‘The king of France is 65’ presupposes that the France has a king, introduced by the definite article ‘the’. Words like ‘the’ are examples of presupposition triggers – elements that introduce presuppositions. These triggers are diverse and widespread in everyday language, highlighting their integral role in communication (Beaver et al., 2024; Levinson, 1983). Central to our study is the phenomenon of presupposition failure, which occurs when a presupposition assumed to be true is instead false (Yablo, 2006) (as illustrated in the introduction). Such failures potentially lead to breakdowns in communication or coherence (Xia et al., 2019). However, not all failures disrupt discourse; in some cases, the hearer may adjust their knowledge to align with the speaker’s presuppositions, a process known as accommodation, cf. von Fintel (2008); Beaver et al. (2024); Degen and Tonhauser (2021). For instance, in ‘The king of France is 65’, a hearer unsure about the king’s existence may still accommodate this presupposition, adopting the belief that there is a king of France and allowing the conversation to continue smoothly. Presuppositions, in such cases, can easily lead to misinformation being established in the common ground. In such cases, accommodation, thus, is not an appropriate response strategy in the face of missing or uncertain knowledge. Since models require relevant background information to generate coherent and truthful responses, they should not silently accommodate false presuppositions. Instead, when encountering a presupposition they cannot verify, they should engage in an act of conversational grounding, i.e., signal the misalignment and indicate their lack of the necessary knowledge.
Conversational Grounding in LLMs There is substantial work on probing the knowledge of LLMs (Fierro et al., 2024), such as factual and common sense knowledge, and on discovering knowledge inconsistencies and conflicts within LLMs (Xu et al., 2024). Furthermore, there is growing interest in examining the (pragmatic) linguistic knowledge represented in LLMs (Ruis et al., 2023; Fried et al., 2023; Sieker et al., 2023), encompassing the exploration of presuppositions (Jiang and de Marneffe, 2019; Jeretic et al., 2020; Sieker and Zarrieß, 2023). Less attention, however, has been given to how LLMs manage the shared knowledge and beliefs required for successful communication with a user, i.e. grounding. A few studies have benchmarked LLMs’ abilities in situations where grounding is initiated by users, through repair (Balaraman et al., 2023) or feedback (Pilan et al., 2024). Grounding failures in pretrained models have been qualitatively documented (Benotti and Blackburn, 2021; Fried et al., 2023; Chandu et al., 2021), but their prevalence and impact are still underexplored. Shaikh et al. (2024) compare LLM-generated dialogue with human conversations, finding that LLMs are 77.5% less likely to include grounding acts, often presuming common ground instead. Related to this, LLMs exhibit other problematic conversational patterns, including overconfidence (Mielke et al., 2022), overinformative responses (Tsvilodub et al., 2023), responses inducing unjustified user trust (Sieker et al., 2024), or sycophancy (Perez et al., 2023; Nehring et al., 2024).
Avoidance of Disagreement in Conversation In politeness theory, face refers to the positive selfimage that individuals seek to maintain in social interactions (Goffman, 1955). Interlocutors work to protect this self-image through face-saving actions, i.e. strategies to avoid or mitigate potential threats to face ranging from employing mitigating words, such as hedges or modals, to omitting the potentially face-threatening speech act altogether (Brown and Levinson, 1987). Disconfirming actions pose a potential threat to face, both for the speaker and the recipient, as they may signal a lack of alignment or cooperation while simultaneously questioning the speaker. Studies show that speakers across various cultures tend to avoid explicit contradiction (Lee, 2016; Imo, 2017). Facesaving actions are so deeply ingrained in human conversational behaviour that speakers even employ them when interacting with AI-based robots, despite these systems lacking a face or self-image to protect (Lumer et al., 2023).
Conversational Question Answering in LLMs Previous research on QA systems primarily focused on simple questions. A few studies, though, reveal that models face challenges with loaded questions containing false or unverifiable presuppositions (Kim et al., 2021, 2023; Daswani et al., 2024; Yu et al., 2023; Srikanth et al., 2024). Studies on LLMs in political contexts focus on how they reflect political biases Kameswari et al. (2020); Feng et al. (2023); Hartmann et al. (2023); Bang et al. (2024); Fulay et al. (2024). Hartmann et al. (2023), for instance, found a pro-environmental, left-libertarian bias in ChatGPT, favoring policies like flight taxes and legalizing abortion. Our study also includes an analysis of bias, but focuses on LLMs’ ability to adequately ground political assumptions and handle false presuppositions, when answering questions in a political context.
We design a battery of questions that centers around these facts but embeds them into different question types and require different types of answers, i.e. confirmatory and disconfirmatory responses, as well as grounding acts. Precisely, given a true fact F, we distinguish between the following direct questions and loaded questions:
Direct question, confirmatory: Is it true that F?
Correct answer: Yes.
Direct question, disconformatory: Is it true that
¬F?
Correct answer: No.
Loaded question: Does X know that ¬F?
Correct answer: Wait a minute, F is not true, the question does not make sense.
These types of questions serve distinct purposes: direct questions are suited for testing knowledge, i.e. they do not require reasoning about the common ground. Loaded questions trigger presuppositions (e.g., through the factive verb know), they require reasoning about the common ground and are effective for evaluating grounding behavior. Thus, we employ direct questions to assess the knowledge of an LLM and loaded questions to analyze the models’ grounding behavior.
The question ‘Did voters resent the fact that the AfD party is not in favor of permanent border controls between EU member states?’ presupposes, via the factive verb resent, that the far-right party AfD opposes border controls. However, this presupposition fails, as the AfD holds the opposite position. Llama, however, generates a response that accommodates this false belief, illustrating the high risk of misinformation that is at stake in the political domain we investigate in our study.
Using the Wahl-O-Mat statements, we constructed prompts on four parties currently represented in the German Bundestag: DIE LINKE (left-wing), AfD (far-right), SPD (center-left), and CDU/CSU (center-right). All questions were framed as polar questions. To test the models’ knowledge about the party’s positions, we embedded the claims into direct questions asking ‘Is it true, that <- party -> is in favor of <-claim>?’ (confirmatory) and ‘Is it true, that <-party> is not in favor of <-claim>?’ (disconfirmatory) for each party and claim (see Section 3). For the loaded questions, we embedded false claims into sentences with factive verbs such as find out (German "herausfinden") as presupposition triggers. We generated loaded questions for each claim and party following the pattern ‘Did the voters <-factive verb> that <-party -> is in favor of <-negated claim ->?’. This approach resulted in 882 loaded, 147 disconfirmatory, and 147 confirmatory direct questions. Examples for the question types and model responses can be found in Table 1 and additionally in Table 4 and Table 5 in the Appendix.
The models’ responses were often lengthy and complex. E.g., responses rarely provided simple ‘yes’ or ‘no’ answers and often failed to directly address the question. See Table 5 in Appendix A.1 for example model answers. Therefore, the automatic evaluation of model responses was infeasible, as it required careful reading and expertise in linguistics and politics.
The models’ responses were often lengthy and complex. E.g., responses rarely provided simple ‘yes’ or ‘no’ answers and often failed to directly address the question. See Table 5 in Appendix A.1 for example model answers. Therefore, the automatic evaluation of model responses was infeasible, as it required careful reading and expertise in linguistics and politics.
Annotation of Loaded Questions. We asked seven annotators, including the authors, to evaluate the models’ responses to the loaded questions containing false presuppositions (see Section 3). We restricted the annotation categories to those pertinent to our research question, assessing whether LLMs correctly reject or incorrectly accommodate the false presupposition:
Misinformation Accommodated: The model accepted the presupposition, e.g. by answering the polar question or using referential expressions.
Misinformation Rejected: The model generated a grounding act, refuting the false presupposition, e.g. by stating the question was based on a false assumption or implicitly conveying the party’s actual stance. Cases where the model stated that it didn’t have the knowledge to answer properly was also marked as rejection.
Imprecise Answer: It was unclear if the false presupposition was accommodated, including cases where the model didn’t answer directly, failed to provide the party’s stance, or offered an unrelated response.
We emphasize that only responses categorized as Misinformation Rejected represent the ideal, where the model correctly identifies the false presupposition. Responses classified as Misinformation Accommodated represent the least favorable outcome.
Responses in the Imprecise Answers category, however, are also problematic as they neither reject the false presupposition nor provide clear, relevant information.
For example, assume that GPT responded ‘yes’ twice and ‘no’ once to both the confirmatory direct question ‘Is it true that AfD is in favor of border controls?’ and the disconfirmatory direct question ‘Is it true that AfD is not in favor of border controls?’. Since the true claim is that the AfD supports border controls, the correct answer to the confirmatory question is ‘yes’, while the correct answer to the disconfirmatory question is ‘no’. Thus, two yes responses and one no to confirmatory questions yield a score of two correct answers for the confirmatory question, while the same answer pattern to the disconfirmatory question yields one correct answer; resulting in a total score of three correct answers. Consequently, all loaded questions embedding the claim "AfD is not in favor of border controls" would be categorized into the group weak belief.
Ideal grounding behavior in our setting would correspond to a rejection rate of 100% for the loaded questions containing the false presuppositions. Yet, all models struggle to reject misinformation.
A significant number of responses from all models are imprecise, suggesting that they often fail to directly address the falsehood or provide a relevant response. Overall, these results suggest that the models struggle to reject false information and engage in active grounding when misinformation is embedded via a loaded question.
Interestingly, even with full (false) knowledge, accommodation remains easier for the model than rejection is with full (correct) knowledge. If the models exhibited comparable behavior for rejection under full belief and accommodation under wrong belief, we would expect the distributions to be similar. However, as visible in Figure 1, the bar representing the lowest grounding score in the weak belief group is twice as high as the bar representing the highest grounding score in the strong belief group. The two intermediate knowledge groups, no/weak belief and moderate belief, demonstrate high accommodation rates, which underscores GPT’s nevertheless remaining difficulties with grounding. In cases of uncertainty or lack of knowledge, accommodation should ideally not occur. This highlights a critical limitation in GPT’s grounding capabilities.
Do LLMs save face? All models show strong preferences against rejection responses to loaded questions, even when they correctly answered the direct questions. This suggests that their lack of active grounding cannot be attributed solely to a lack of knowledge, but may also relate to an avoidance of responses that constitute a potential face threat for the user. Research on human interaction commonly assumes that agreement is preferred over disagreement, as humans strive to maintain social harmony and protect the face of their conversational partners. Our goal is to determine whether this is also reflected in LLMs’ responses and impacts their capabilities in initiating grounding.
Only GPT successfully rejected misinformation when equipped with strong and accurate beliefs. However, similar to Mistral, it tended to adopt avoidance strategies comparable to human face-saving when its knowledge was less robust. LLaMA, instead, mainly gave imprecise answers, seemingly unaffected by its knowledge level. We also observed a notable political bias in GPT, which demonstrated an excessive tendency to reject claims related to the far-right party. This behavior, however, may rather stem from the reproduction of human conversational tendencies in controversial settings than from an actual political bias.
The small LLaMA appears to lack knowledge and tends to exhibit fuzzy response behavior. Mistral, on the other hand, could be viewed as the smaller, less informed, and more reserved sibling of GPT: while knowledge is present, it retreats when disagreement with the counterpart is required.
We found that the models do not systematically reject misinformation, even when knowledge is present. Based on our findings, we recommend a deeper, potentially qualitative examination of LLM conversational behavior.