Post-training for Efficient Communication via Convention Formation

Paper · arXiv 2508.06482 · Published August 8, 2025

Humans communicate with increasing efficiency in multi-turn interactions, by adapting their language and forming ad-hoc conventions. In contrast, prior work shows that LLMs do not naturally show this behavior. We develop a post-training process to develop this ability through targeted finetuning on heuristically identified demonstrations of convention formation. We evaluate with two new benchmarks focused on this capability. First, we design a focused, cognitively-motivated interaction benchmark that consistently elicits strong convention formation trends in humans. Second, we create a new document-grounded reference completion task that reflects in-the-wild convention formation behavior. Our studies show significantly improved convention formation abilities in post-trained LLMs across the two evaluation methods.

Humans naturally display rapid adaptation during linguistic interactions by developing increasingly efficient ways to refer to concepts. This formation of ad-hoc linguistic conventions has been repeatedly observed in studies (Krauss &Weinheimer, 1964; Brennan & Clark, 1996; Hawkins et al., 2020a), and is a cornerstone of naturalistic human language interaction. It not only improves the accuracy of relaying information, but also reduces its costs. Contemporary large language models (LLMs), on the other hand, do not naturally show this behavior (Hua & Artzi, 2024).

We propose a targeted post-training process to develop this general ability in LLMs, such that models spontaneously form conventions as an in-context behavior. We heuristically extract examples of convention formation from human corpora to construct minimal pairs, where the minimal difference between two paired examples is in the demonstration of this behavior. In addition, we enhance the reasoning process of model by introducing reference planning tokens, which mark when a referent is a re-mention. We treat this data as preference pairs for DPO-style policy optimization (Rafailov et al., 2024), and carefully design the optimization process to acquire generalizable convention formation ability.

Preference Data Construction We use a coreference resolution (Bagga, 1998) model on scripts of TV series, a domain of text that is rich with conversational interaction, to identify repeating instances of convention formation, and modify the data to create preference pairs. Coreference resolution generates reference chains, where mentions that refer to the same entity appear in their order in the text. Such chains often display convention formation, with later references being shorter and more consistent.

We heuristically identify examples where a concept is initially referred to (i.e., mentioned) with a noun phrase in an utterance i and is re-mentioned in a later utterance j with a more concise referring expression. Each such example may also include intermediate re-mentions, meaning that the re-mention in utterance j is not the first re-mention. Therefore, each reference chain can provide multiple demonstrations. This is important because it shows to the model that the desired behavior is not only in the first re-mention, but persists over the entire reference chain.

We use this data to construct triples of (x, yw, yl), where x is the conversational context (i.e., the history of the interaction), yw is the preferred continuation, and yl is the dis-preferred continuation. We set x to be the conversation history until a reference, and yw and yl to represent desired and undesired utterance to contain the reference. We create two types of preference pairs. The first type is a demonstration of the observed convention behavior. We set yw to be the observed re-mention text, which reflects convention formation, and yl to use the more verbose first-mention text as the re-mention. This aims to suppress the behavior of simply repeating mentions verbatim, and encourage adjusting them as expected when conventions can be formed. The second type aims to preserve how the first mentions are generated, because the model must avoid a faux conventionalization behavior in the first mention, where it has no shared common ground. We set yw to the original first mention, and yl to the conventionalized re-mention that was observed in utterance j. We extract the first type of pairs from reference chains of at least two re-mentions, to better demonstrate the convention formation process. For the second type, we extract from reference chains of at least one re-mention. We process 2,000 TV scripts from Chen et al. (2022), and the total numbers of extracted first and second types of pairs are 11,106 and 10,135 respectively.

Adding Mention Planning Tokens We further modify the preference pairs to explicitly reflect the distinction between mentions and re-mentions using a special token: re-mentions are preceded by a [remention] token. We expect this explicit marking to allow the model to better separate its treatment of initial and later mentions. This separation could allow the model to develop convention formation skills, without hurting how initial mentions are generated. Beyond adding the new token to existing preference pairs where appropriate, we also create new preference pairs focused on training the model to use this token. We add triplets (x, yw, yl), where if yw captures a re-mention, then yw contains the corresponding planning token and yl does not. Conversely, if yw captures a first-mention, then yw does not contain the planning token and yl contains a misplaced planning token preceding the first-mention. Empirically, we observe the planning token to help (Section 6). Figure 9 in Appendix E.1 shows an example training instance.