Can Large Language Models Understand Context?

Paper · arXiv 2402.00858 · Published February 1, 2024

Understanding context is key to understanding human language, an ability which Large Language Models (LLMs) have been increasingly seen to demonstrate to an impressive extent. However, though the evaluation of LLMs encompasses various domains within the realm of Natural Language Processing, limited attention has been paid to probing their linguistic capability of understanding contextual features. This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models.

Discourse understanding, as one of the fundamental problems in NLP, focuses on modeling linguistic features and structures that go beyond individual sentences (Joty et al., 2019). Understanding discourse requires resolving the relations between words/phrases (coreference resolution) and discourse units (discourse parsing and discourse relation classification) in the previous context, identifying carry-over information for the following context (dialogue state tracking), and recognizing discourse-specific phenomena (ellipsis).

The coreference resolution (CR) task contributes to achieving a coherent understanding of the overall meaning conveyed within the text. Thus, it plays a critical role in diving into language models’ capability to grasp coreference relations as well as contextual nuances within documents. We select two coreference datasets: WSC273 (Levesque et al., 2012) and OntoNotes 5.0 (Pradhan et al., 2013). WSC273, which contains the first 273 examples from the Winograd Schema Challenge, is a dataset that requires the system to read a sentence with an ambiguous pronoun and select the referent of that pronoun from two choices. OntoNotes is a human-annotated corpus of documents annotated with multiple layers of linguistic information including syntax, propositions, named entities, word sense, and in-document coreference. As it is one of the most frequently used datasets for training coreference models, prior research has achieved significant advancements under the supervised finetuning paradigm (Lee et al., 2017; Joshi et al., 2020; Bohnet et al., 2023). However, these model designs cannot be extended to generative models under ICL settings. Recently, Le and Ritter (2023) have leveraged document templates for LLMs; however, their evaluation is confined to prominent models such as InstructGPT (Ouyang et al., 2022), neglecting the fact that smaller models lack the generative capacity required to accomplish such tasks. Due to these limitations, we propose a novel multiple-choice task design. In this design, we provide the mentions and evaluate the model on resolution. Each option represents a potentially markable span.2 Table 1 presents an example of the input to the model3. The entire prompt consists of five parts: (1) an instruction that provides guidance to the model for the task, (2) a document containing plain text with a selected mention span highlighted using a bold symbol, (3) a list of choices, which includes all the gold mentions present in the document, (4) a question that directs the model’s attention, and (5) a guiding word answer that prompts for the output. We experiment with multiple instructions and prompts and provide the one with the best performance. Linking scores are computed for each question and the results are subsequently aggregated for evaluation. We utilize the official evaluation metrics from the CoNLL-2012 shared task (Pradhan et al., 2012), which employs the CoNLL F1 score, derived from the averaging of three coreference metrics: MUC, B3, and CEAFϕ4.