Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy

Paper · arXiv 2501.11721 · Published January 20, 2025

Large language models (LLMs) have demonstrated remarkable proficiency in generating detailed and coherent explanations of complex concepts. However, the extent to which these models truly comprehend the concepts they articulate remains unclear. To assess the level of comprehension of a model relative to the content it generates, we implemented a self-evaluation pipeline where models: (i) given a topic generate an excerpt with information about the topic, (ii) given an excerpt generate question-answer pairs, and finally (iii) given a question generate an answer. We refer to this self-evaluation approach as Explain-Query-Test (EQT). Interestingly, the accuracy on generated questions resulting from running the EQT pipeline correlates strongly with the model performance as verified by typical benchmarks such as MMLU-PRO. In other words, EQT’s performance is predictive of MMLU-PRO’s, and EQT can be used to rank models without the need for any external source of evaluation data other than lists of topics of interest. Moreover, our results reveal a disparity between the models’ ability to produce detailed explanations and their performance on questions related to those explanations. This gap highlights fundamental limitations in the internal knowledge representation and reasoning abilities of current LLMs.

While LLMs demonstrate remarkable prowess in generating detailed explanations of concepts, an important question arises: Does this ability reflect true comprehension, or is it simply a sophisticated form of pattern recognition? More specifically, when an LLM explains a concept, can it answer related questions derived from that explanation without direct access to the explanation during testing?

Self-evaluation is crucial for understanding whether LLMs possess genuine reasoning abilities or merely exploit correlations in training data. By focusing on the relationship between explanation generation and subsequent question answering, we aim to probe the depth of their internal knowledge and the robustness of their reasoning capabilities.

Understanding this disconnect is crucial for several reasons. First, the ability to explain concepts and correctly answer related questions is fundamental for applications in education, healthcare, and decision-making systems (Bommasani et al., 2021). For instance, an LLM used in education should not only provide clear explanations to students but also demonstrate understanding by accurately answering follow-up questions. Second, if models fail at this task, it highlights limitations in their internal knowledge representation and reasoning, signaling risks for high-stakes applications where reliability and understanding are paramount. Finally, this evaluation aligns with broader efforts to ensure that AI systems exhibit true understanding rather than merely leveraging statistical correlations in data (Bender et al., 2021).

In this study, we propose a novel self-evaluation framework, Explain-Query-Test (EQT), to assess to what extent state-of-the-art LLMs can independently answer questions derived from their own explanations, without access to those explanations during testing. EQT is performed in three steps: (i) given a topic, a model generates an excerpt with information about the topic, (ii) given an excerpt, the same model then generates question-answer pairs, and finally (iii) a model is given a question and generates an answer. By decoupling explanation from question answering, EQT tests the models’ internal knowledge, reasoning, and consistency, requiring them to rely on deeper comprehension rather than surface-level text patterns. This allows us to rigorously measure not just whether LLMs can generate plausible explanations, but whether they can independently apply their knowledge to novel yet related tasks that revolve around the same underlying knowledge.

For instance, categories such as biology and psychology, where models initially perform well, show significant degradation in accuracy. On the other hand, categories such as law and engineering, where models already exhibit lower baseline performance, experience smaller relative drops. This trend suggests that the drop in performance is influenced by the disparity between surface-level accuracy and the deeper understanding required to answer questions derived from explanations. Models may struggle to leverage the same high accuracy in the original dataset to maintain consistency in tasks that demand conceptual reasoning

Using the EQT approach we introduced, we evaluated models by prompting them to generate detailed explanations and then testing their ability to answer derived questions independently. The results revealed a significant gap between the models’ ability to generate coherent explanations and their performance on questions derived from those explanations. This discrepancy highlights fundamental limitations in the internal knowledge representation and reasoning capabilities of current LLMs.