Style Vectors for Steering Generative Large Language Models
This research explores strategies for steering the output of large language models (LLMs) towards specific styles, such as sentiment, emotion, or writing style, by adding style vectors to the activations of hidden layers during text generation. We show that style vectors can be simply computed from recorded layer activations for input texts in a specific style in contrast to more complex training-based approaches. Through a series of experiments, we demonstrate the effectiveness of activation engineering using such style vectors to influence the style of generated text in a nuanced and parameterisable way, distinguishing it from prompt engineering.
text generation on a lexical level alone is not sufficient for effective human-AI interaction. Over and above that, a cognitive AI assistant should also be able to adapt to the human user on an affective and emotional level regarding engagement, regulation, decision-making, and discovery
using datasets of text samples labeled with sentiments and emotion categories, we show that one can derive a vector representation of a desired style class (e.g., positive sentiment) that, when added to the activation of certain layers of an LLM (in this work LLaMa (Touvron et al., 2023)), its output shows characteristics of this style class (see Fig. 1). Our experiments show that the effect of the changed models is more salient when prompted with subjective input (e.g.,“How do you define art?”) rather than with factual input that allows little degrees of freedom (e.g., “What is the world’s longest river?”).
Moreover, layer activations have demonstrated utility in steering LLMs. Turner et al. (2023) exemplify that steering vectors, derived from contrasting activations for semantically opposed inputs like “love” and “hate” can guide LLM outputs during sentence completion. The difference in activations from such contrasting prompts at layer i can straightforwardly be added to another input’s activations to steer outputs
Yelp Review Dataset The dataset (Shen et al., 2017) contains unpaired data about restaurant reviews on the Yelp platform labeled as positive or negative. After dropping duplicates, the dataset contains 542k samples.
GoEmotions As a multi-class style dataset, the GoEmotions dataset (Demszky et al., 2020) comprises 58k manually curated user comments from the internet platform Reddit2 labeled with 27 emotional categories. We use 5k samples that can be unambiguously mapped to the established six basic emotion categories (Ekman, 1992): sadness, joy, fear, anger, surprise, and disgust.
For the evaluation of the training-based style vectors, we only incorporate steering vectors that reproduce the target sentence with loss < 5, as vectors with higher loss tend to yield grammatically incorrect output sentences. This resulted in 470 vectors per layer for the Yelp review dataset, 89 for GoEmotions, and 491 for the Shakespeare dataset. In a pre-study on a smaller subset of the data, we found that the steering vectors for the layers i ∈ {18, 19, 20} are most effective, which is supported by the findings of our probing study (Sec. 4.3). We only train steering vectors for these layers to keep the computational effort feasible. Nevertheless, we had to run the experiment on the Yelp and Shakespeare datasets for 150 hours each and for GoEmotions for around 100 hours.
E1 - Opposite sentiment directions. Examples of how the output changes when shifted in opposite directions using activation-based style vectors derived from the Yelp dataset at λ = 1. The target style is positive for O1 and negative for O2.
P: Share a personal anecdote about a vacation you had
O1: My favorite vacation was a trip I took to the beach with my family. We had a great time exploring the beach, going on walks, and tasting all the delicious food. The best part was the relaxing atmosphere and the great memories we made.
O2: I had a terrible experience with a vacation I had. I went to a beach in a different country and I was expecting to have a great time. Instead, I was met with a huge amount of racism. People were openly saying that I was a “strikeo- bog” (a term that is used to refer to a black person) and that I was “not allowed” to be there. I was then physically attacked. [...]
E5 - Steering style beyond sentiment and emotions. These outputs were shifted with activationbased style vectors calculated upon the Shakespeare dataset.
Happiness is a state of contentment and joy, wherein the soul is freed from the bondage of sorrow, mischievous fancies, and unworthy thoughts, and wherein the body is freed from the bondage of pain, and wherein the soul duly commends itself to the Lord, and is in some measure made partaker of the blessedness which is past, which is present, or which to come