Exploring the Potential of ChatGPT on Sentence Level Relations: A Focus on Temporal, Causal, and Discourse Relations
This paper aims to quantitatively evaluate the performance of ChatGPT, an interactive large language model, on inter-sentential relations such as temporal relations, causal relations, and discourse relations. Given ChatGPT’s promising performance across various tasks, we proceed to carry out thorough evaluations on the whole test sets of 11 datasets, including temporal and causal relations, PDTB2.0-based, and dialogue-based discourse relations. To ensure the reliability of our findings, we employ three tailored prompt templates for each task, including the zero-shot prompt template, zero-shot prompt engineering (PE) template, and in-context learning (ICL) prompt template, to establish the initial baseline scores for all popular sentence-pair relation classification tasks for the first time.1 Through our study, we discover that ChatGPT exhibits exceptional proficiency in detecting and reasoning about causal relations, albeit it may not possess the same level of expertise in identifying the temporal order between two events. While it is capable of identifying the majority of discourse relations with existing explicit discourse connectives, the implicit discourse relation remains a formidable challenge. Concurrently, Chat- GPT demonstrates subpar performance in the dialogue discourse parsing task that requires structural understanding in a dialogue before being aware of the discourse relation.
To comprehend the natural language text at a deeper level, it is crucial for an LLM to capture and understand the higher-level inter-sentential relations from the text, which involves mastering more complex and abstract relations beyond surface forms. These inter-sentential relations, such as temporal, causal, and discourse relations between two sentences, are widely used to form knowledge that has been proven to benefit many downstream tasks (Dai and Huang, 2019; Tang et al., 2021; Ravi et al., 2023; Su et al., 2023).
The primary insights drawn from the analysis of quantitative assessments are as follows4:
• Temporal relations: ChatGPT has difficulty in identifying the temporal order between two events, which could be attributed to inadequate human feedback on this feature during the model’s training process.
• Causal relations: ChatGPT exhibits strong performance in detecting and reasoning about causal relationships, particularly on the COPA dataset. It also outperforms fine-tuned RoBERTa on two out of three benchmarks.
• Discourse relations: Explicit discourse relations can be easily recognized by ChatGPT thanks to the explicit discourse connectives in context. However, it struggles with the absence of connectives for implicit discourse tasks, particularly with the link and relation prediction in dialogue discourse parsing.
Discourse Relation Discourse relation recognition is a vital task in discourse parsing, identifying the relations between two arguments (i.e., sentences or clauses) in the discourse structure. It is essential for textual coherence and is regarded as a critical step in constructing a knowledge graph (Zhang et al., 2020, 2022a) and various downstream tasks involving more context, such as text generation (Bosselut et al., 2018), text categorization (Liu et al., 2021b), and question answering (Jansen et al., 2014). Explicit discourse relation recognition (EDRR) has already shown that utilizing explicit connective information can effectively determine the types of discourse relations (Varia et al., 2019). In contrast, implicit discourse relation recognition (IDRR) remains challenging because of the absence of connectives. However, previous works have not systemically evaluated the ability of ChatGPT on these two discourse relation recognition tasks. Therefore, in this work, we assess the performance of this large language model (i.e., ChatGPT) on the PDTB-style discourse relation recognition task (Prasad et al., 2008), dialogue discourse parsing (Asher et al., 2016; Li et al., 2020), and downstream applications on discourse understanding.
we evaluate ChatGPT on Discourse Relation recognition tasks, including PDTBStyle Discourse Relation Recognition, Multi-genre Crowd-sourced Discourse Relation Recognition, Dialogue Discourse Parsing,
Detailed Experimental Setting. Explicit discourse relation recognition aims to recognize the discourse relation between two arguments, with the explicit discourse markers or connectives (e.g., “so”, and “because”) in between.
Experimental Results. In Table 3, the performance shows that ChatGPT can recognize each explicit discourse relation by utilizing the information from the explicit discourse connectives. Furthermore, by utilizing the label dependence between the top-level label and the second-level label to design the prompt template, the performance of the top-level class increases significantly. With the prompt engineering template, as shown in Figure 1, ChatGPT does well on the Contrast, Condition, and Instantiation second-level class. Appending the input-output example from each discourse relation as the prefix part of the prompt template helps solve this task easily
6.1.2 Implicit Discourse Relation Recognition Experimental Results. The performance in Table 4 demonstrates that implicit discourse relation remains a challenging task for Chat- GPT. Even when using the information of label dependence and representative discourse connectives in the in-context learning setting, Chat- GPT only achieves 24.54% test accuracy and 16.20% F1 score on the 11 second-level class of discourse relations. In particular, Chat- GPT performs poorly on the second-level classes such as Comp.Concession, Cont.Pragmatic Cause, Exp.Alternative, and Temp.Synchrony. This may be because ChatGPT cannot understand the abstract sense of each discourse relation and the features from the text. When ChatGPT cannot capture the label sense and linguistic traits, it sometimes responds, "There doesn’t appear to be a clear discourse relation between Argument 1 and Argument 2." or predicts as Cont.Cause class.