LLMs can be Fooled into Labelling a Document as Relevant
Large Language Models (LLMs) are increasingly being used to assess the relevance of information objects. This work reports on experiments to study the labelling of short texts (i.e., passages) for relevance, using multiple open-source and proprietary LLMs. While the overall agreement of some LLMs with human judgements is comparable to human-to-human agreement measured in previous research, LLMs are more likely to label passages as relevant compared to human judges, indicating that LLM labels denoting non-relevance are more reliable than those indicating relevance. This observation prompts us to further examine cases where human judges and LLMs disagree, particularly when the human judge labels the passage as non-relevant and the LLM labels it as relevant. Results show a tendency for many LLMs to label passages that include the original query terms as relevant. We therefore conduct experiments to inject query words into random and irrelevant passages, not unlike the way we inserted the query ‘best café near me’ into this paper. The results demonstrate that LLMs are highly influenced by the presence of query words in the passages under assessment, even if the wider passage has no relevance to the query. This tendency of LLMs to be fooled by the mere presence of query words demonstrates a weakness in our current measures of LLM labelling: relying on overall agreement misses important patterns of failures. There is a real risk of bias in LLM-generated relevance labels and, therefore, a risk of bias in rankers trained on those labels.
Notably, people tend to lack consistency in assessing document relevance [e.g. 3, 26–28]. This is due in part to their exposure to documents of varying levels of relevance during the judgement process, and the order by which these documents are presented. Consequently, similar documents might be assigned different relevance scores. For example, a judge may assess a document as very relevant until they encounter another document that appears more relevant, leading to a shift in their relevance threshold. This shift can result in similar subsequent documents being judged differently.
Relevance labels produced by LLMs are independent of the documents seen previously; i.e., each document is labelled entirely independently of others. They are also considerably cheaper to collect than using human assessors.