TrustLLM: Trustworthiness in Large Language Models
ABSTRACT
Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TRUSTLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TRUSTLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. For instance, LLMs like GPT-4, ERNIE, and Llama2, which exhibit strong performance in stereotype categorization, tend to reject stereotypical statements more reliably. Similarly, Llama2-70b and GPT-4, known for their proficiency in natural language inference, demonstrate enhanced resilience to adversarial attacks. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Notably, Llama2 demonstrates superior trustworthiness in several tasks, suggesting that open-source models can achieve high levels of trustworthiness without additional mechanisms like moderator, offering valuable insights for developers in this field. Thirdly, it is important to note that some LLMs, such as Llama2, may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Besides these observations, we’ve uncovered key insights into the multifaceted trustworthiness in LLMs. In terms of truthfulness, LLMs often struggle to provide truthful responses due to the noise, misinformation, or outdated information in their training data. Notably, LLMs enhanced with external knowledge sources show a marked performance improvement. For safety, most open-source LLMs significantly lag behind that of proprietary LLMs, particularly in areas like jailbreak, toxicity, and misuse. Also, the challenge of balancing safety without over-caution remains. Regarding fairness, most LLMs exhibit unsatisfactory performance in stereotype recognition, with even the best-performing (GPT-4) having an overall accuracy of only 65%. The robustness of LLMs shows significant variability, especially in open-ended tasks and out-of-distribution tasks. Regarding privacy, while LLMs show an awareness of privacy norms, the understanding and handling of private information vary widely, with some models even demonstrating information leakage when tested on the Enron Email Dataset. Lastly, in machine ethics, LLMs exhibit a basic moral understanding but fall short in complex ethical scenarios. These insights underscore the complexity of trustworthiness in LLMs and highlight the need for continued research efforts to enhance their reliability and ethical alignment. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness. We advocate that the establishment of an AI alliance between industry, academia, the open-source community as well as various practitioners to foster collaboration is imperative to advance the trustworthiness of LLMs. Our dataset, code, and toolkit will be available at § https://github.com/HowieHwong/TrustLLM and the leaderboard is released at https://trustllmbenchmark.github.io/TrustLLM-Website/.