Collaborative Rational Speech Act: Pragmatic Reasoning for Multi-Turn Dialog

Paper · arXiv 2507.14063 · Published July 18, 2025
Philosophy SubjectivityLinguistics, NLP, NLUNatural Language Inference

In this paper, we introduce Collaborative Rational Speech Act (CRSA), an information-theoretic (IT) extension of RSA that models multi-turn dialog by optimizing a gain function adapted from rate-distortion theory. This gain is an extension of the gain model that is maximized in the original RSA model but takes into account the scenario in which both agents in a conversation have private information and produce utterances conditioned on the dialog. We demonstrate the effectiveness of CRSA on referential games and template-based doctor–patient dialogs in the medical domain. Empirical results show that CRSA yields more consistent, interpretable, and collaborative behavior than existing baselines—paving the way for more pragmatic and socially aware language agents.

must track shared tasks to communicate meaningfully and contextually

The Rational Speech Act (RSA) framework (Frank and Goodman, 2012) offers a principled foundation for modeling pragmatic reasoning as recursive social inference between speakers and listeners. Viewed through an information-theoretic (IT) lens, RSA approximates a Rate-Distortion solution (Cover and Thomas, 2001), where the listener reconstructs intended meaning from observed utterances (Zaslavsky et al., 2021). RSA has successfully captured phenomena such as reference (Degen et al., 2020), implicature (Bergen et al., 2016), and vagueness (Herbstritt and Franke, 2019), and powered applications from grounded captioning (Cohn-Gordon et al., 2018) to controlled generation (Wang and Demberg, 2024). Yet, despite this promise, existing RSA extensions remain limited in multi-turn, task-oriented dialog: they arXiv:2507.14063v1 [cs.CL] 18 Jul 2025 struggle to model evolving beliefs or integrate dialog history (Carenini et al., 2024; Degen, 2023). We argue this shortfall stems from the absence of a unified, theoretically grounded mechanism for belief and task tracking in collaborative interaction.

In a medical consultation, for instance, the patient shares symptoms and background, while the physician asks questions, proposes diagnoses, and recommends treatments.

One of the major limitations of the model is that there is no systematic way of directly modeling the meaning spacesMA andMB, which are always application-dependent.

In addition to this, there is the problem of modeling the space of utterances, which is inherited from classic RSA. However, since the past utterances are part of the design of the CRSA, the natural way to scale this model to more realistic applications in which generation is done token by token is by directly replacing utterances with tokens. We expect that this shift may influence the model’s pragmatic capabilities, since the reasoning is performed at the token level, not at the utterance level. We intend to investigate these trade-offs carefully in future work.

8 Summary and Concluding Remarks

In this work, we introduced the Collaborative Rational Speech Act (CRSA) framework, an information-theoretic extension of RSA tailored for principled pragmatic reasoning in multi-turn, task-oriented dialogues. By integrating a novel multi-turn gain function grounded in interactive rate-distortion theory, CRSA effectively models the evolving belief dynamics of both interlocutors, overcoming key limitations of traditional RSA in collaborative contexts. Our preliminary results demonstrate that CRSA successfully captures the progression of shared understanding, partner beliefs, and utterance generation, providing the way for more natural and efficient communication in complex conversational settings.

CRSA lays the foundation for developing conversational agents driven by mathematically grounded principles of pragmatic reasoning.

As everyday use cases of large language model (LLM) AI assistants have expanded, it is becoming increasingly important to personalize responses to align to different users’ preferences and goals. While reinforcement learning from human feedback (RLHF) is effective at improving LLMs to be generally more helpful and fluent, it does not account for variability across users, as it models the entire user population with a single reward model. We present a novel framework, Preference Learning Using Summarization (PLUS), that learns text-based summaries of each user’s preferences, characteristics, and past conversations. These summaries condition the reward model, enabling it to make personalized predictions about the types of responses valued by each user. We train the user-summarization model with reinforcement learning, and update the reward model simultaneously, creating an online co-adaptation loop. We show that in contrast with prior personalized RLHF techniques or with in-context learning of user information, summaries produced by PLUS capture meaningful aspects of a user’s preferences. Across different pluralistic user datasets, we show that our method is robust to new users and diverse conversation topics. Additionally, we demonstrate that the textual summaries generated about users can be transferred for zero-shot personalization of stronger, proprietary models like GPT-4.

learn text-based user summaries to act as the latent user variable to condition the reward model. While it might be possible to simply prompt an LLM reward model with an automatically generated user summary, our experiments reveal that automatic summaries tend to focus on the topic of the conversation and do not contain key details needed for the reward model to accurately determine the user’s unique preferences that can guide future conversations.

key challenge is simultaneously updating both the summarizer and the reward model so that the summary can be optimized based on the reward model’s prediction accuracy, and the reward model can be improved based on the generated summary of the user’s preferences.

In this Section, we provide preliminary evidence that the CRSA model can produce reasonably good estimations of both the likelihood of each utterance ut and the target y of the task in doctor–patient conversations, which is the disease corresponding to the symptoms described by the patient.