Can Large Language Models Capture Human Annotator Disagreements?
Human annotation variation (i.e., annotation disagreements) is common in NLP and often reflects important information such as task subjectivity and sample ambiguity. While Large Language Models (LLMs) are increasingly used for automatic annotation to reduce human effort, their evaluation often focuses on predicting the majority-voted “ground truth” labels. It is still unclear, however, whether these models also capture informative human annotation variation. Our work addresses this gap by extensively evaluating LLMs’ ability to predict annotation disagreements without access to repeated human labels. Our results show that LLMs struggle with modeling disagreements, which can be overlooked by majority label-based evaluations. Notably, while RLVR-style1 reasoning generally boosts LLM performance, it degrades performance in disagreement prediction. Our findings highlight the critical need for evaluating and improving LLM annotators in disagreement modeling.
the performance of these LLM annotators is evaluated against a majority label or agreement with humans
Therefore, we identify the following practice-evaluation gap:
While LLM annotators are widely studied and deployed, there is no evaluation of whether they can capture informative human disagreements.
Such evaluation can be particularly important for LLMs optimized on tasks with single-deterministic answers (e.g., RL with verifiable rewards), which contrasts with the reality that many annotation tasks involve multiple valid perspectives
In other words: rather than measuring whether LLMs can reproduce the majority opinion, we want to know whether they can reproduce the distribution over human answers
We find that RLVR-style reasoning significantly harms disagreement prediction when human annotation variance is high. Moreover, forcing additional reasoning effort (Muennighoff et al., 2025) does not improve the performance of RLVR LLMs. In contrast, for RLHF LLMs, Chain-of-Thought (CoT, Wei et al., 2023) reasoning significantly improves disagreement prediction. Furthermore, RLVR LLMs are better with a deterministic goal (e.g., predicting the majority annotation) than with a probabilistic goal (e.g., predicting the proportion of human disagreements). Our findings suggest that using LLM annotators—especially with RLVR LLMs and subjective tasks—requires extra caution, as these models may overlook critical human disagreements.