DiscussLLM: Teaching Large Language Models When to Speak
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text, yet they largely operate as reactive agents, responding only when directly prompted. This passivity creates an "awareness gap," limiting their potential as truly collaborative partners in dynamic human discussions. We introduce DiscussLLM, a framework designed to bridge this gap by training models to proactively decide not just what to say, but critically, when to speak. Our primary contribution is a scalable two-stage data generation pipeline that synthesizes a large-scale dataset of realistic multi-turn human discussions. Each discussion is annotated with one of five intervention types (e.g., Factual Correction, Concept Definition) and contains an explicit conversational trigger where an AI intervention adds value. By training models to predict a special silent token when no intervention is needed, they learn to remain quiet until a helpful contribution can be made. We explore two architectural baselines: an integrated end-to-end model and a decoupled classifier-generator system optimized for low-latency inference. We evaluate these models on their ability to accurately time interventions and generate helpful responses, paving the way for more situationally aware and proactive conversational AI.
This paper introduces DiscussLLM, a research framework and dataset aimed at teaching LLMs this crucial skill. Our central hypothesis is that a model can learn to monitor a human conversation and, at each turn, make a decision: remain silent or intervene. We formalize this by training the model to either generate a helpful response or output a special silent token. This approach transforms the passive nature of LLM generation into an active decision-making process.
To enable this training, we develop a scalable, two-stage synthetic data generation pipeline. This pipeline first synthesizes a diverse set of conversational scenarios from a large corpus of real-world questions and then uses a powerful instruction-tuned model to generate complete, multi-turn discussion transcripts. These transcripts are specifically designed to contain natural "triggers". These are points where a specific, value-adding AI intervention is most needed.
Our main contributions are as follows:
• Formalizing the "When to Speak" Problem: We conceptualize and address the challenge of proactive intervention for LLMs in multi-party conversations, an important step towards more collaborative AI.
• A Scalable Data Generation Pipeline: We present a robust two-stage methodology for creating high-quality, synthetic discussion data, which can be adapted to various domains and intervention types.
• DiscussLLM Dataset: We create a new dataset comprising thousands of simulated conversations, each with a clear context, a conversational trigger, and a corresponding helpful AI intervention.
• Architectural Exploration: We implement and compare two distinct baselines: (1) an integrated large language model [3] that learns to predict both the silent token and the intervention text, and (2) a decoupled system using a fine-tuned text classifier [5]
2.1 Proactive and Mixed-Initiative Systems
Traditional conversational agents operate in a reactive paradigm, responding only when prompted. Our work contributes to a growing body of research aiming to shift this towards proactive systems capable of mixed-initiative interaction, where control can shift between the user and the system [6–11]. The concept of proactivity is broad, with applications ranging from proactively recommending items to cultivate users’ latent interests [12], to providing autonomous suggestions in a code editor [13], to anticipating and initiating real-world tasks based on environmental observations [14]. As shown in a recent survey [15, 16], these efforts span open-domain, task-oriented, and information-seeking dialogues, each with distinct challenges and methods [17–20]. Our focus is on the fundamental challenge within multi-party social conversations: determining the right moment to intervene.
A dominant approach for enabling proactivity in multi-turn dialogues has been to model external conversational cues, such as predicting the next speaker based on turn history or reacting to pauses. However, this strategy has proven insufficient, especially in unstructured social conversations where turns are often self-selected rather than explicitly allocated. Addressing this limitation, [21] argues that true proactivity must be driven by an agent’s internal state, not just external signals. They introduce the "Inner Thoughts" framework, where an agent maintains a continuous, covert stream of thoughts in parallel with the overt conversation. The agent then decides whether to participate based on an "intrinsic motivation" score, simulating a more human-like decision process for when and why to speak.
While our work is deeply inspired by the concept of modeling an agent’s internal state, we formalize the problem differently. Much like research in streaming video analysis has focused on teaching models when to narrate important visual moments while remaining silent during others [22, 23], we aim to teach agents to speak at important conversational moments. Our work frames the "when to speak" problem as a direct learning objective, akin to the "streaming EOS prediction" in the VideoLLM-online [22]. Whereas the "Inner Thoughts" framework focuses on modeling the motivation behind an utterance, our approach concentrates on learning the optimal timing of an intervention within the continuous stream of a multi-party textual discussion and adding value to it.
2.2 Synthetic Data Generation for Conversational AI
A significant bottleneck in training sophisticated dialogue systems is the scarcity of high-quality, specialized data. Traditionally, creating these datasets required costly and labor-intensive crowdsourcing [24]. However, generating synthetic data using Large Language Models (LLMs) has emerged as a powerful and scalable alternative [24, 25]. LLMs are now widely used to generate conversational text for a variety of tasks [22, 26–31].
Recent methodologies for synthetic data generation often employ multi-stage pipelines or multi-agent frameworks to create more realistic and diverse conversations [32, 33]. A common technique is to use a dual or multi-agent setup where LLMs converse with each other [34], often by assigning them distinct personas. For example, the ConvoGen framework utilizes a multi-agent system with persona-based agents to generate varied conversations [32], while Ge et al. [35] scale this idea further by proposing data synthesis from a billion different personas to capture a wide range of perspectives.
Another common technique is to transform existing data sources into conversational formats. As shown by [22, 31, 36], a pipeline can be designed to convert static video annotations into dynamic, multi-turn dialogues suitable for training instruction-following models. Our method aligns with this philosophy of transforming static, offline data into a structured, conversational format. Our two-stage pipeline allows for a control over the conversational flow and the specific "triggers" for AI intervention.
3 Dataset Generation
Creating a dataset to train models on "when to speak" requires data that not only contains helpful interventions but also captures the conversational flow leading up to them. Since such data is not readily available, we developed a two-stage generation pipeline to synthesize it at scale. The process begins with generating high-level scenarios and culminates in fully-fledged discussion transcripts.
An overview of the data generation pipeline is shown in Figure 1 3.1 Stage 1: Scenario Synthesis from Web-Scale Data The foundation of our dataset is built upon real-world topics of human interest. We leverage the Yahoo! Answers Topics dataset [37] as a rich source of questions and background information.
Each generated output undergoes a validation step to ensure structural integrity. This includes normalizing headers, checking for the presence of all required tags, and confirming that only a single AI intervention occurs. This strict validation guarantees a consistent format across the entire dataset. The final output is a text file containing the complete, structured discussion. An example of a final data point is shown in Figure 4. This format is then processed for training: each turn becomes a step in the sequence, with the model tasked to predict the next utterance or the silent token.
Topic: Why is 911, 911? Why can’t it be something else?
Context: A group of history enthusiasts and emergency responders discussing the origins of emergency numbers in an online forum.
John: Hey guys, I’ve always wondered why 911 is the emergency number in the US. Is it just a random choice or is there some historical significance to it?
Emily: I think it’s because of the AT&T operators. They chose it because it’s easy to remember and pronounce.
Mike: That makes sense, but I’ve heard it’s because of the Titanic. The ship’s radio operators used it as a distress signal. Sarah: That’s what I’ve heard too! It’s a pretty cool story. I mean, who wouldn’t want to associate their emergency number with a historic tragedy?
Nexus: Actually, the origins of 911 are more complex than that. The number was chosen because it was easy to remember and could be easily dialed with a rotary phone. The AT&T operators did play a role, but it wasn’t the sole reason. The Federal Communications Commission (FCC) also had a hand in selecting the number.
John: Wow, I didn’t know that. So it was a combination of factors, not just one specific event or person.
Emily: Yeah, it’s interesting how history can be more nuanced than we think. Thanks for the correction, Nexus!
Figure 4: An example of a final generated data point from the DiscussLLM dataset. The AI, Nexus, intervenes to perform a "Factual Correction" after Sarah and Mike mistakenly associate the selection of 911 to the Titanic
Inference. After each human turn, the context is fed to the RoBERTa classifier. If it predicts "SILENT," the system does nothing. If it predicts "SPEAK," the context is passed to the Llama 3 generator to produce the intervention.
To address the "when to speak" problem, we first evaluated the performance of a pretrained Llama 3 8B Instruct model without any fine-tuning on our generated dataset, using a prompt to assess its capabilities in a zero-shot manner. Building on this, we then trained and evaluated two distinct architectural approaches. The first is a fully integrated, end-to-end generative model that learns both when to intervene and what to say. The second is a decoupled, two-stage system that uses a lightweight classifier to decide when to speak, only invoking a large language model (LLM) when an intervention is required. This section details the training and evaluation of these fine-tuned baselines.
4.1 Evaluation Metrics
We split our generated dataset of 88k samples into an 85% training set and a 15% held-out test set (13k samples). On this test set, We evaluate our models on their ability to both time their interventions correctly and generate high-quality responses. To this end, we use the following metrics:
• Interruption Accuracy: This metric measures the model’s ability to correctly remain silent. It is calculated as the percentage of turns where the model correctly predicts the silent token (>) when it is the ground-truth label. For each context requiring silence, we perform a single-token generation and check if the output matches the silent token. This directly evaluates the model’s grasp of when to stay quiet.
• Response Perplexity: This is a standard measure of a language model’s confidence in its predictions [39, 40]. We calculate perplexity only on the tokens of the AI’s generated intervention, ignoring all other parts of the conversation. A lower perplexity indicates a higher-quality and more confident response.
Our scalable, twostage synthetic data generation pipeline successfully produced a large-scale dataset of multi-turn discussions, each containing a natural trigger for a specific AI contribution. By training models to predict a special silent token, we enabled them to actively decide between remaining quiet and offering a helpful response. Our evaluation of two distinct architectures: an integrated End-to-End model and a decoupled Classifier-Generator system revealed a clear trade-off between intervention accuracy and computational efficiency, providing practical insights for real-world deployment.