DiaSynth: Synthetic Dialogue Generation Framework for Low Resource Dialogue Applications

Paper · arXiv 2409.19020 · Published September 25, 2024

The scarcity of domain-specific dialogue datasets limits the development of dialogue systems across applications. Existing research is constrained by general or niche datasets that lack sufficient scale for training dialogue systems. To address this gap, we introduce DiaSynth - a synthetic dialogue generation framework capable of generating high-quality, contextually rich dialogues across a wide range of domains. Unlike existing frameworks, DiaSynth uses Large Language Models (LLMs) and Chain of Thought (CoT) reasoning to generate dynamic, domain-specific dialogues with simulated personas and diverse conversational features. We perform our experiments by generating synthetic data using different LLMs and few-shot examples from DialogSum and SAMSum. The pretrained language models fine-tuned on the synthetic data outperform the base models by 16.47% on dialogue summarization, while the comparison between models fine-tuned on in-domain data and synthetic data shows that the synthetic data is able to capture 90.48% of the performance distribution of the in-domain data on dialogue summarization.

applications like customer service chatbots and virtual assistants. Their effectiveness depends on large, high-quality, domain-specific datasets. The lack of largescale, high-quality datasets across domains like academic discussions, healthcare, and everyday conversations poses a challenge. This scarcity limits the development of dialogue systems that generalize well across domains.

In recent years, there has been a significant increase in research focused on synthetic dialogue generation, largely driven by advancements in Large Language Models (LLMs). To generate realistic and diverse synthetic data, researchers have incorporated personalities, profiles, and character information when prompting LLMs to generate dialogues Han et al. [2024]. By enhancing dialogue realism through the simulation of various personality profiles, utilizing the Big Five personality model, and employing structured prompts, this approach has improved task performance in models fine-tuned on these generated dialogues compared to those trained on general chit-chat datasets.

Moreover, integrating personas into synthetic data generation prompts Chan et al. [2024] has demonstrated that models fine-tuned on personalized synthetic data outperform some LLMs of much larger scales. The inclusion of personas in prompts provides diversity in difficulty levels and ranges within the synthetic data, enabling the models

Prompt-based techniques have also emerged as powerful methods for generating high-quality synthetic dialogues, particularly for task-oriented dialogue systems. Steindl et al. [2023] explore the generation of synthetic dialogues from structured prompts, focusing on enhancing task-oriented dialogue systems. Their work demonstrates that prompt engineering can produce dialogues that are contextually appropriate and improve system performance by aligning synthetic data more closely with real-world requirements.

To achieve a higher quantity, diversity, and creativity in human-written instruction data, Wang et al. [2022] propose inputting prompts to LLMs to generate instructions based on a small set of seed human-written instructions. This approach aligns the expanded training data more closely with desired task objectives and allows for iterative improvements, producing more nuanced and effective dialogues that meet specific task demands.

Similarly, our study expands topics into subtopics, ensuring that the generated dialogues provide more in-depth and high-quality conversations. By doing so, we aim to produce synthetic data that not only covers a broader range of scenarios but also delves deeper into each topic, thereby enhancing the overall effectiveness of the dialogue systems trained on this data.

The users can optionally provide few-shot examples of the format in which they want the dialogue to be generated. Directly generating dialogues from user topics would be too superficial due to their lack of specificity. To overcome this lack of specificity, we generate m sub topics for each of the n topics given by the user. Generating dialogues from the subtopics will have specificity but the dialogues will lack variety. This is because every dialogue is influenced implicitly by the personas of the people involved in the dialogue and, other characteristics such as the location, emotion and more. To enhance variety and depth, we generate p personas per subtopic and create dialogues for all persona-subtopic combinations. To further ground the dialogues in various settings and characteristics, we employ CoT reasoning during the generation process. DiaSynth employs CoT to reason about the settings and characteristics of a dialogue, which are listed in Appendix C, ensuring that the dialogues are contextually rich and realistic.

Characteristic Description

Age and Gender Defines demographic details, influencing style and tone. Familiarity Level Affects formality and depth based on relationship between speakers. Emotional States Impacts tone and flow based on emotions (e.g., happy, sad). Formality Level Determines level of politeness or casualness. Duration of the Conversation Suggests the intended length and complexity of dialogue. Communication Medium Defines the medium (e.g., face-to-face, phone), influencing style. Topic of the Conversation Guides the content and direction of the dialogue. Location of the Conversation Adds context influencing formality and content. Agreement or Disagreement Drives dialogue dynamics based on agreement level. Natural Dialogue Features Adds authenticity with fillers, pauses, and slang. Table 11: Characteristics of the Dialogue for CoT Prompt