LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models
We conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral using chain-of-thought prompting. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations. Furthermore, they sometimes overlook contextual information necessary for reasoning to arrive at the correct conclusion.
Propositional Logic (PL) Propositional logic employs a collection of statements or propositions (denoted as P = p1, p2, ..., pn, where pi represents a proposition) and builds upon them using logical connectives such as ‘∧’, ‘∨’, ‘→’, ‘↔’, and ‘¬’. Several inference rules for propositional logic have been defined using which given a set of premises, one can derive a sound conclusion. For example, let’s consider two propositions: p1, which states "It is raining," and p2, which states "It is cloudy." From these propositions, we can construct a context/knowledge base (KB) with two premises: (1) p1 → p2 and (2) p1. With this KB, we can conclude p2. This inference rule is written as ((p1 → p2) ∧ p1) ⊢ p2 and is known as ‘Modus Ponens’. In our study, we explore nine distinct inference rules of propositional logic, extensions of seven of them with one-variable and a universal quantifier, and two axioms of first-order logic as shown in Table 2. These inference rules provide a proper framework for deriving valid conclusions.
First-order Logic (FOL) In this work, we consider a restricted set of logical axioms for FOL that utilize quantifiers, ∀ (universal quantifier) and ∃ (existential quantifier). The universal quantifier (∀) denotes that a statement holds true for all instances within a specific category. In contrast, the existential quantifier (∃) indicates that a statement is true for at least one instance within its scope. For instance, a simple extension of propositional ‘Modus Ponens’ is an inference rule where given the premises ∀(p(x) → q(x)) and p(a), we conclude q(a) (e.g., given “All kings are greedy” and “Sam is a king”, we can conclude “Sam is greedy”). Here, we explore two axioms (EG and UI - in detail in App. C.3) and various inference rules that incorporate the quantifiers (shown in Table 2).
Non-monotonic (NM) Reasoning In this work, we analyze a range of logical reasoning templates in NM logics involving “Default Reasoning,” “Reasoning about Unknown Expectations,” and “Reasoning about Priorities.” These templates are inspired by the compilation (Lifschitz, 1989) made in 1989 to evaluate the abilities of various non-monotonic logics that were being developed at that time. Below Table 3 shows examples of NM reasoning. Additional examples are given in App. C.4.
A key aspect of NM logics is to formalize notions such as “normally,” “typically,” and “usually” that are not directly formalizable using classical quantifiers in the first-order setting. The general rule “Heavy blocks are normally located on the table” does not imply that “All heavy blocks are always located on the table”. Rather, this rule allows for exceptions. Our work explores various NM reasoning patterns, as depicted in Figure 1, to delve deeper into the nuances of this type of reasoning.