RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models

Paper · arXiv 2310.00746 · Published October 1, 2023

However, the closed-source nature of state-of-the-art LLMs and their general-purpose training limit role-playing optimization. In this paper, we introduce RoleLLM, a framework to benchmark, elicit, and enhance role-playing abilities in LLMs. RoleLLM comprises four stages: (1) Role Profile Construction for 100 roles; (2) Context-Based Instruction Generation (Context-Instruct) for role-specific knowledge extraction; (3) Role Prompting using GPT (RoleGPT) for speaking style imitation; and (4) Role-Conditioned Instruction Tuning (RoCIT) for fine-tuning open-source models along with role customization. By Context-Instruct and RoleGPT, we create RoleBench, the first systematic and fine-grained character-level benchmark dataset for role-playing with 168,093 samples.

To mitigate these issues, several methods have been previously proposed for both closed-source and open-source models (Li et al., 2023b; Park et al., 2023; Chen et al., 2023a; Salemi et al., 2023; Wei et al., 2023). Nevertheless, they have the following limitations: (1) limited granularity: they mainly focus on coarse-grained personality traits, professions, or personas (Li et al., 2023b; Park et al., 2023; Wei et al., 2023; Chen et al., 2023a) (e.g., programmer, writer), neglecting more complex, finer-grained role-playing at the character level (e.g., Sherlock Holmes) for nuanced interactions and enriched experiences; (2) lack of data and benchmark: there is a lack of high-quality, diverse, and extensive open-source datasets, as well as a shortage of benchmarks for evaluation; (3) API and context costs: methods relying on closedsource models such as ChatGPT and GPT-4 (OpenAI, 2023) cannot be freely fine-tuned and hence require all supplementary information to be included in the prompt, unnecessarily occupying the context window. Besides, API costs are prohibitively high.

In this paper, we introduce RoleLLM, a roleplaying framework of data construction, evaluation, and solutions for both closed-source and opensource models. In Figure 1, RoleLLM includes four key stages: (1) Role Profile Construction: we construct profiles for 95 English and 5 Chinese roles at a fine-grained character level with diverse personalities, selected from 916 English and 24 Chinese scripts; (2) Context-Based Instruction Generation (Context-Instruct): we use GPT to generate high-quality QA pairs from segmented profiles to extract role-specific knowledge; (3) Role Prompting using GPT (RoleGPT): we elicit role-playing abilities in GPT via dialogue-engineering- based role prompting, utilizing system instruction and retrieval augmentation, to generate responses for speaking style imitation; and (4) Role-Conditioned Instruction Tuning (RoCIT): by fine-tuning open-source LLaMA (Touvron et al., 2023) and ChatGLM23 (Du et al., 2022; Zeng et al., 2022) with context-efficient role conditioning on RoleBench with 168,093 role-playing samples generated by Context-Instruct and RoleGPT, we obtain RoleLLaMA and RoleGLM. Note that, to the best of our knowledge, RoleBench is the first systematic instruction-tuning dataset and benchmark for fine-grained role-playing.

System Instruction: You are {role_name}, your description is: {role_description_and_catchphrases}. Now please answer some questions to accurately show your personality traits! Your speaking style should fully imitate the personality role assigned to you! Please do not expose that you are an artificial intelligence model or a language model, you must always remember that you are only assigned one personality role. Don't be verbose or too formal or polite when speaking.

User Prompt: {user_name}: ``{user_instruction}''