Minds versus Machines: Rethinking Entailment Verification with Language Models

Paper · arXiv 2402.03686 · Published February 6, 2024

Leveraging a comprehensively curated entailment verification benchmark, we evaluate both human and LLM performance across various reasoning categories. Our benchmark includes datasets from three categories (NLI, contextual QA, and rationales) that include multi-sentence premises and different knowledge types, thereby evaluating the inference capabilities in complex reasoning instances. Notably, our findings reveal LLMs’ superiority in multi-hop reasoning across extended contexts, while humans excel in tasks necessitating simple deductive reasoning. Leveraging these insights, we introduce a fine-tuned Flan-T5 model1 that outperforms GPT-3.5 and rivals with GPT-4, offering a robust open-source solution for entailment verification. As a practical application, we showcase the efficacy of our fine-tuned model in enhancing self-consistency in model-generated explanations, resulting in a 6% performance boost on average across three multiple-choice question-answering datasets.

• Type of Premise: Typically, NLI datasets, such as SNLI, MNLI, etc., do not contain more than one sentence in the premise, potentially leading to shortcut learning. In contrast, we focus more on multi-sentence premises that require complex reasoning. We also consider datasets where the premise is a rationale, i.e., the premise is not just a logical precursor to the hypothesis but rather an explanation. This tests the ability to evaluate model-generated rationales (Wei et al., 2022).

• Type of Knowledge: Often, one or more information in the premise needs to be used to predict support. We categorize this information as entity-grounded, commonsense, or localized. Entity-grounded knowledge consists of information about entities and other general knowledge verifiable on the internet. These can be facts about general science, history, etc., or details of some known person, event, etc. It is possible to infer this information even if not mentioned in the premise. The commonsense knowledge is typically all information about everyday life that humans use implicitly but cannot always be verified online. This information is often missing from the premise and has to be inferred implicitly. Lastly, localized information is all other knowledge provided for understanding the events, people, or items mentioned in the premise that is not grounded to any known entity. This information depends on the premise’s specific context and, thus, is impossible to infer unless stated explicitly. Please refer to Table 2 for examples of each knowledge type.

Contextual QA Next, we consider QA datasets where the task is to answer a question based on a given context and some options. We use an off-the-shelf QA-to-statement converter model (Chen et al., 2021) to generate a hypothesis statement for each question option pair. Then, the hypothesis corresponding to the correct choice is marked as “support”, while the rest are marked as “not support” to create the entailment verification dataset.