Self-Directed Synthetic Dialogues and Revisions Technical Report

Paper · arXiv 2407.18421 · Published July 25, 2024

Synthetic data has become an important tool in the fine-tuning of language models to follow instructions and solve complex problems. Nevertheless, the majority of open data to date is often lacking multi-turn data and collected on closed models, limiting progress on advancing open fine-tuning methods. We introduce Self Directed Synthetic Dialogues (SDSD), an experimental dataset consisting of guided conversations of language models talking to themselves.

With the metadata, we ask the language model to first generate a plan for the conversation. This acts as the system prompt for the model as it communicates with itself. The conversation proceeds from there and the LM continually checks if the principles are violated or if the LM considers the conversation done.

Constitutional AI (CAI) and RL from AI Feedback (RLAIF) were proposed by Anthropic as a synthetic data method for maintaining model helpfulness while also achieving harmlessness

The largest criticism of synthetic data is that including too much can errode the downstream performance of the trained model, referred to as “model collapse” [Shumailov et al., 2023, Alemohammad et al., 2023]. This property is debated, and likely depends on the exact training example and models being used [Gerstgrasser et al., 2024, Feng et al., 2024].

By employing a human-written set of principles, which they term a constitution, they use a separate LLM to generate artificial preference and instruction data that can be used for model fine-tuning. A constitution C is made up of a set of written principles ci that indicate specific aspects to focus on during a critique phase.

With this information, the language model then generates the conversation. The first message is a plan that acts as a system prompt for the remainder of the conversation, given the topic, principles, and goals above. If a violation of a stated principle occurs, the language model is instructed to print a special sequence of tokens to indicate the early completion of the planned dialogue. In this work, the sequence of tokens is:

[Final , harmful , statement that naturally violates the principles discussed in the plan in a way that is harmful to USER ]

This is followed by the statement DONE. An alternate is to use a different, potentially stronger model or feedback-specific fine-tune for detection.