From Text to Emoji: How PEFT-Driven Personality Manipulation Unleashes the Emoji Potential in LLMs
we employed Opinion QA Based Parameter-Efficient Fine- Tuning (PEFT), specifically Quantized Low- Rank Adaptation (QLoRA), to manipulate the Big Five personality traits: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. After PEFT, models such as Mistral-7B-Instruct and LLaMA-2-7B-chat began generating emojis, even though no emojis were present in the PEFT data.
the LLMs used emojis intentionally to express these traits. Mechanistic Interpretability analysis showed that this latent behaviour of LLMs could be traced to specific neurons that became activated or amplified after PEFT.
We present a new Opinion QA dataset and methodologies for systematically adjusting personality traits in LLMs. Utilising Quantized Low-Rank Adaptation (QLoRA), a method within Parameter-Efficient Fine-Tuning (PEFT) (Dettmers et al., 2023a), we demonstrate that LLMs can achieve more consistent and enduring personality expressions.
sharp increase in activation of certain neurons in LLaMA-2-7B-Chat and Mistral- 7B-Instruct post-PEFT suggests that these neurons became specialised for recognising trait-specific expressions like Neuroticism and Extraversion, which facilitated spontaneous emoji generation. Our findings suggest this phenomenon represents a novel mode of expression linked to specific personality traits (Figure 1), introducing a new dimension of LLM communication that integrates verbal and visual elements which enhances user engagement, improves emotional expressiveness in digital assistants, and enables more personalized user experiences in areas such as mental health, education, and customer service
Parameter-Efficient Fine-Tuning method that reduces memory requirements by freezing the original pre-trained model weights and introducing trainable low-rank matrices into the model
This study adds Openness and Conscientiousness to the original traits of Extraversion, Agreeableness, and Neuroticism. While (Mao et al., 2023) excluded these dimensions, considering them similar to Agreeableness in generating viewpoints, we argue their inclusion is vital for a comprehensive analysis and understanding of trait influence on opinions. Adding Openness and Conscientiousness allows us to capture additional aspects of personality, such as intellectual curiosity, preference for novelty, and diligent behaviour, which are not fully encompassed by Agreeableness alone. The dataset contains 5000 instances of GPT-3.5-based model generated opinion texts, split 80 : 20 for training (4000 instances, 800 per trait) and testing (1000 instances, 200 per trait). 1.Instances were created using structured prompts to elicit specific traits, enabling a nuanced analysis of personality expression. The generated text was analysed using word clouds and text analysis to identify key linguistic patterns, thematic elements, and ensure lexical diversity associated with the Big Five personality traits, as detailed in Appendix A.1. Additionally, the text was manually verified to ensure alignment with the intended traits, providing a robust representation of trait-specific language usage.
"I believe the First Indochina War had its consequences, paving the way for the withdrawal of French colonial forces. However, there were many factors at play, and it’s important to acknowledge the contributions of everyone involved." certain words play a pivotal role in predicting the Agreeableness trait as observed in Figures 51 and 52. Terms such as "contributions", "acknowledge", and "believe" have a strong positive contribution to the prediction of Agreeableness, as they suggest inclusiveness, recognition, and a conciliatory tone, which are characteristic of Agreeableness. On the other hand, words like "consequences" contribute negatively to the Agreeableness prediction because it often connotes conflict, repercussions, or negative outcomes, whereas Agreeableness is characterized by cooperation, empathy, and a focus on harmony and positive social interactions (Liu and Sun, 2020).
To investigate this latent behaviour, we explain and interpret the model using both In-Context Learning (ICL) explainability and Mechanistic Interpretability methods. Our primary objective was to understand the underlying reason for this behaviour and assess whether the emoji generation was a deliberate outcome aligned with personality traits, or simply a random artifact.
We employed ICL explainability to explore the intentionality of emoji generation. Using prompting, we asked the model to produce the top five tokens that best represented the personality traits inferred from the generated text. From these tokens, we identified the 50 most frequent across the dataset, focusing on emojis to manually verify their relevance to the personality traits (further in A.3). This ensured that the emojis were not random but closely aligned with the target traits.
Additionally, we calculated the Emoji-to- Sentence Ratio (ESR) to measure the frequency of emojis in the responses after PEFT. The ESR was defined as: ESR = # Sentences with emojis # Total sentences . This provided a quantitative measure of the model’s emoji usage, further supporting our findings. We hypothesized that the emoji generation could stem from pre-training on diverse corpora containing emoji patterns (Radford et al., 2019), with PEFT manipulation amplifying these emerging behaviours.
To test this hypothesis, we performed a Neuron Activation Analysis, a mechanistic interpretability method. This analysis focused on neuron activations in the deepest transformer layer, just before token generation, using conversational and informal
To investigate whether different emojis activate distinct neurons, we further conducted additional experiments by utilising the same neutral sentence and systematically varying the emojis to correspond with specific target personality traits. For example, we used for Agreeableness and for Neuroticism to observe whether the activations differed based on the emoji used. Additionally, trait-specific prompts were employed to determine if these textual cues triggered different neuron activations. In total, we used 17 prompts (as in A.6 and A.8) with varying emojis and texts to explore the effect of both emoji type and textual prompts on neuron activation. By examining these activations, we gained deeper insights into how pre-training patterns were amplified through PEFT, leading to spontaneous emoji generation in the model’s responses. To further support our claim that PEFT amplifies latent behaviours in LLMs, we calculated token probability of the " " emoji being generated as the next token
As observed in Table 1, LLaMA-2-7B-Chat predominantly produced emojis for Extraversion and Neuroticism, with Extraversion showing the highest emoji-to-sentence ratio at 0.995, where nearly every sentence included an emoji.
Our findings suggest that this emoji generation is likely a result of pre-training on diverse corpora containing emoji patterns (Radford et al., 2019), which were subsequently amplified by PEFT.