Can LLM be a Personalized Judge?
In this paper, we investigate the reliability of LLM-as-a-Personalized- Judge—asking LLMs to judge user preferences based on personas. Our findings suggest that directly applying LLM-as-a-Personalized-Judge is less reliable than previously assumed, showing low and inconsistent agreement with human ground truth. The personas typically used are often overly simplistic, resulting in low predictive power. To address these issues, we introduce verbal uncertainty estimation into the LLM-as-a-Personalized-Judge pipeline, allowing the model to express low confidence on uncertain judgments. This adjustment leads to much higher agreement (above 80%) on high-certainty samples for binary tasks. Through human evaluation, we find that the LLM-as-a-Personalized-Judge achieves comparable performance to third-party humans evaluation and even surpasses human performance on high-certainty samples.
For instance, while knowing an individual’s profession as a doctor may offer some insights about this individual, it does not necessarily inform us about their preferred types of beverages. Ideally, we would include demographic, behavioral, and contextual factors that are relevant to the specific task at hand. However, defining the relevant set of variables a priori is inherently difficult. Additionally, even if surveys are designed to gather this information, acquiring such detailed information on a large scale is often impractical and can frequently result in incomplete responses. We refer to this challenge as the persona sparsity issue. In practice, this means that in some cases, we cannot reasonably infer preferences based on the available persona information and should therefore deprioritize such samples. This motivates us to explore verbal uncertainty estimation as a method to filter out cases where persona information is insufficient for the LLM-Judge to make well-informed judgments.