CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants
We use CONSCENDI to exhaustively generate training data with two key LLM-powered components: scenario-augmented generation and contrastive training examples. When generating conversational data, we generate a set of rule-breaking scenarios, which enumerate a diverse set of high-level ways a rule can be violated. This scenario-guided approach produces a diverse training set and provides chatbot designers greater control. To generate contrastive examples, we prompt the LLM to alter conversations with violations into acceptable conversations to enable fine-grained distinctions. We then use this data, generated by CONSCENDI, to train a smaller model.
….
Customized domain-specific guardrails often consist of manually-engineered LLM prompts. Yet constructing a prompt sufficiently robust to all rule-breaking behavior is challenging through instructions and in-context examples alone. For example, a rule prohibiting an agent from stating political opinions can guard against generating controversial text. Yet defining the intricacies of this rule is challenging – are widely accepted statements acceptable, but more sectarian statements out-of-bounds?
….
Therefore, we propose a multi-stage data generation pipeline to ensure GPT-4 produces a broad, domain-specific dataset. We begin by prompting an LLM to generate a variety of scenarios that illustrate different ways a dialog agent might break each given rule. Scenarios can be added or removed from this set given the engineer’s preferences, providing a granular level of control. Next, we use GPT-4 to simulate a conversation between a user and a dialog agent that violates the rule according to the provided scenario. This scenario-guided data generation method results in a more diverse set of examples compared to directly generating conversations.
Furthermore, we employ a contrastive approach to generate non-violating conversations that are alterations of a conversation with violations (Uehara et al., 2020). In addition to directly generating non-violating conversations, contrastive example generation takes further advantage of GPT-4’s generation capabilities and provides a richer dataset for model training. The combined dataset is used to fine-tune models to serve as guardrail models. We show these distilled models can often serve as better guardrail models than prompt-based LLMs, providing a crucial tool for user-facing text generation tools.