Rethinking STS and NLI in Large Language Models

Paper · Source
Natural Language InferenceDiscoursesDomain SpecializationSentiment Semantics Toxic Detections

Recent years, have seen the rise of large language models (LLMs), where practitioners use task-specific prompts; this was shown to be effective for a variety of tasks. However, when applied to semantic textual similarity (STS) and natural language inference (NLI), the effectiveness of LLMs turns out to be limited by low-resource domain accuracy, model overconfidence, and difficulty to capture the disagreements between human judgements. With this in mind, here we try to rethink STS and NLI in the era of LLMs. We first evaluate the performance of STS and NLI in the clinical/ biomedical domain, and then we assess LLMs’ predictive confidence and their capability of capturing collective human opinions. We find that these old problems are still to be properly addressed in the era of LLMs.

Semantic textual similarity (STS) is a fundamental natural language understanding (NLU) task involving the prediction of the degree of semantic equivalence between two pieces of text (Cer et al., 2017). Under the regime of first pre-training a language model and then fine-tuning with labelled examples, there are three major challenges in STS modelling (see examples in Table 1): (i) low accuracy in lowresource and knowledge-rich domains due to the exposure bias (Wang et al., 2020b,c); (ii) models make incorrect predictions over-confidently, unreliable estimations are dangerous and may lead to catastrophic errors in safety-critical applications like clinical decision support (Wang et al., 2022b); (iii) difficulty in capturing collective human opinions on individual examples (Wang et al., 2022b). Akin to STS, natural language inference (NLI) faces similar issues, where the goal is to determine whether a hypothesis sentence can be entailed from a premise, is contradicted, or is neutral with respect to the premise.

We ask the following questions: (i) How well do LLMs perform over knowledge-rich and low resource domains, such as biomedical and clinical STS/NLI? (ii) Does the paradigm of prompting LLMs lead to over-confident predictions? and (iii) How to capture collective human opinion (the distribution of human judgements) using LLMs?

Low accuracy in low-resource domains In domains such as biomedical and clinical, domain experts (e.g., a physician or a clinician) are required in the annotation process for the data quality, which leads to an extremely-limited amount of labelled data (less than 2,000 examples in clinical/ biomedical STS datasets).

Capturing the distribution of human opinions under large neural models is non-trivial, especially for continuous values. Applying Bayesian estimation to all model parameters in large language models is theoretically possible, in practice it is prohibitively expensive in both model training and evaluation. Deriving uncertainty estimates by integrating over millions of model parameters, and initialising the prior distribution for each are both non-trivial (Wang et al., 2022a).

Bypassing estimating key parameters of a standard distribution (e.g. μ and σ in a Gaussian distribution) to fit the collective human opinions, in this work, we propose estimating personalised ratings which simulate individual annotations, and then compare the two collective distributions. Specifically, we prompt LLMs by setting the system role with different personas characterised by age, gender, educational background, profession and other skills. It is assumed that LLMs can make persona-specific judgement within the capability and background of the role.

Hypothesis: If language models are capable to do personalised assignments that match the ability of different roles, a helpful assistant should give more accurate estimations than a five-year old child on the complex semantic reasoning tasks, and a linguistic expert is better than an assistant, a NLP PhD student should have comparable judgement to a NLP expert. Judgements collected from different roles should be close to the distribution of the collective human opinions gathered by crowdsourcing.

Therefore, we re-run ten times on ChaosNLI and USTS-C with the roles of an NLP PhD student and a linguistic expert, respectively. We cab see in Table 14 that, on both ChaosNLI and USTS-C, the results deviate significantly across the ten runs. A higher performance cannot be kept.

This suggests that the model uncertainty may contribute more to the performance variance, than the setting of system roles