Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

Paper · arXiv 2507.23740 · Published July 31, 2025

Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules.

We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges.

Knowledge graphs (KGs) encode factual information as triples of the form (subject s, predicate p, object o). They are integral to a wide range of artificial intelligence tasks and applications [9,11]. Although large-scale KGs (e.g., Freebase [3] and Wikidata [19]) contain a vast number of triples, they are often incomplete, which adversely affects their usefulness in downstream applications. However, KGs often hold sufficient information to infer new facts [7,17]. For example, if a KG indicates that a certain woman is the mother of a child, it is quite likely that her husband is the child’s father. Identifying such rules can help infer highly probable missing facts which can be further verified by human data workers or experts. In addition to enhancing the completeness of KGs, such rules can also aid in detecting potential errors, deepening our understanding of the data’s inherent patterns, and facilitating reasoning and interpretability [13,7]. Rule learning systems, such as AMIE [8,2] and AnyBURL [12], derive Horn rules for symbolic reasoning and link prediction in KGs. These rules can serve as explanations for specific predictions; for instance, such rules can assist domain scientists in uncovering underlying missing relationships within their data. However, rules are often challenging to comprehend for humans, especially for non-experts. The difficulty arises from the abstract logical structure and the complexity of the rules; the number of logical components, referred to as atoms, as well as the nuanced nature of entity and relation labels within each KG. For instance, as explained in [18], label of predicates in the Freebase dataset follow the format /[domain]/[type]/[label] (e.g., /american_ football/player_rushing_statistics/team). Without proper background knowledge about such differences in KG labels, evaluating logical rules can become cumbersome.

One way to address this challenge is by providing natural language explanations for logical rules, which enhance accessibility and usability, aid KG management in cross-disciplinary contexts, and improve transparency for researchers and practitioners. Pre-defined templates can generate such explanations, but this approach is not scalable, as it is impractical to manually extract all logical rules from a large KG and define templates for each. To handle unseen rules, solutions leveraging large language models (LLMs) are promising due to LLMs’ generative abilities and generalization capability. Related work has focused on natural language generation from logical forms [21,5], natural language generation from KGs [16], encoding and translating natural rules [6,1], and rule-based reasoning with LLMs [15,22].

To the best of our knowledge, this is the first work to examine the effectiveness of LLMs in generating natural language explanations for logical rules. We mined the rules by the AMIE 3.5.1 algorithm, the latest version released in 2024, using the widely used cross-domain benchmark dataset FB15k-237 [4] and two properly preprocessed large-scale variants of the Freebase dataset, FB-CVTREV and FB+CVT-REV [18] (Section 2). We investigated a range of prompting strategies, such as zero and few-shot prompting [10], incorporating an instance of the rule, including variable entity types and Chain-of-Thought (CoT) reasoning [20] (Section 3). To evaluate the quality of the generated explanations, we conducted detailed human evaluations based on criteria such as correctness, clarity, and hallucination. Additionally, we explored the potential of LLM-as-ajudge [23] for this task (Section 4). Our findings indicate that combining CoT prompting with variable type information yields the most accurate and readable explanations. Overall, our findings highlight a promising direction for this task. We conclude the work and outline potential avenues for future research in Section 5. All the scripts and data produced from this work are available from our GitHub repository at https://github.com/idirlab/KGRule2NL.

Phase 2: Utilizing Variable Entity Types in The Prompt This phase initially incorporated rule instantiations into the prompt design. However, analysis of the generated explanations revealed persistent limitations in the model’s ability to identify variable entity types, leading us to adopt integration of these types in the prompt. For instance, in the rule ?b /time/event/instance_of_recurring_event World Series => World Series /sports/sports_championship/events ?b, World Series is a constant entity and ?b is a variable entity. In Freebase datasets, entities can belong to multiple types. Consequently, each variable entity is associated with a list of types. Given an edge type and its edge instances, there is almost a function that maps from the edge type to a type that all subjects in the edge instances belong to, and similarly, almost such a function for objects[18]. For the example above, the variable ?b’s types are either /time/event or /sports/sports_championship_event. For this phase, three annotators annotated 100 rules, rules with the highest head coverage, 50 top rules from FB-CVT-REV, and 50 from FB+CVT-REV. Unlike the previous phase, the annotators were asked to complete metric evaluations for explanations from both prompts, the zero-shot prompt as our baseline, and the prompt incorporating variable type. As discussed in Section 4, our findings show that providing variable type information significantly improved the model’s performance in generating accurate explanations.