Can discourse-level structure and conversational-level organization work together?
This explores whether two different layers of language organization — discourse-level structure (how a single text or turn is internally built: what points backward, what points forward, how arguments are framed) and conversational-level organization (how turns connect across a dialogue: topic tracking, common ground, repair) — reinforce each other or operate independently in LLMs.
This explores whether the way a model organizes a single piece of text and the way it manages a whole conversation are actually the same problem viewed at two scales — and the corpus suggests they're deeply linked but currently disconnected in practice. At the discourse level, Does ChatGPT organize text differently than human writers? finds that ChatGPT defaults to summarizing what was already said (anaphoric), while human writers point forward to set up arguments to come (cataphoric) — and crucially traces this to autoregressive, token-by-token generation. That backward-looking habit isn't just a stylistic quirk; it's a structural disposition that would naturally bleed into how a model handles a conversation.
And it does. At the conversational level, Can LLMs truly update shared conversational common ground? shows LLMs treat the opening prompt as a fixed frame and interpret every later turn inside it, never jointly revising the shared assumptions. That's the same backward-anchoring failure, scaled up: a model that organizes text by referring back rather than projecting forward will also struggle to let a conversation's common ground move. The discourse-level finding and the conversation-level finding are two readings of one underlying limitation.
The encouraging part is that the corpus shows the two layers genuinely working together when the architecture is built for it. Can conversation structure predict dialogue success better than content? (TRACE) finds that structural features of a dialogue predict success at 68% — nearly matching content at 70% — but a hybrid of structure plus content jumps to 80%. Structure and substance aren't redundant; they're complementary channels, and combining the 'how' with the 'what' beats either alone. Similarly, Can dialogue systems track both speakers' beliefs across turns? (CRSA) supplies the missing forward-projecting machinery: it tracks both speakers' beliefs across turns, modeling the progression from partial to shared understanding — exactly the cataphoric, anticipatory move that token-level systems lack.
What ties this together is that conversational organization turns out to be a learnable layer sitting on top of discourse competence, not an emergent byproduct of it. Why don't language models develop conversation maintenance skills? argues maintenance (repair, topic hand-off) is social action that training never rewards, and Why do language models engage with conversational distractors? shows the gap closes with just ~1,080 targeted dialogues — it's an absent signal, not a capacity ceiling. Meanwhile What semantic failures break dialogue coherence most realistically? uses Abstract Meaning Representation to catch failures (contradiction, broken coreference, irrelevancy, disengagement) that live precisely at the seam between sentence-level structure and conversation-level flow — failures text-surface analysis alone misses.
The thing you might not have expected: the corpus also offers a contrarian vote. Does structured artifact sharing outperform conversational coordination? (MetaGPT) finds that for multi-agent coordination, structured shared artifacts beat conversational exchange entirely — sometimes the cleanest way to make the two layers cooperate is to lift the organizing structure out of the conversation and into an explicit document. So the answer is yes, they can work together — and the highest-leverage designs either fuse them (hybrid structural+content models, bidirectional belief tracking) or deliberately separate the structural scaffolding from the conversational stream.
Sources 8 notes
ChatGPT defaults to summarizing what was already said, while students use more forward-pointing structure that previews upcoming arguments. This reflects different reader models and may stem from how autoregressive generation works token by token.
LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.
TRACE achieved 68% accuracy predicting dialogue success from structural features alone, matching a 70% content-based baseline. A hybrid combining both reached 80%, suggesting how agents communicate rivals what they say.
CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.
Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.
Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.
Research using Abstract Meaning Representation identified four distinct incoherence types: contradiction, coreference inconsistency, irrelevancy, and decreased engagement. AMR-trained classifiers detect these semantic failures while text-level manipulations alone cannot.
MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.