Do language models segment events like human consensus does?
Can GPT-3 identify event boundaries in narrative text the way humans do? This matters because it could reveal whether language models and human cognition share similar predictive mechanisms for understanding continuous experience.
Humans perceive continuous experience as discrete events — "restaurant visits" and "train rides" — with identifiable boundaries. Studying event cognition requires these boundaries to be annotated, typically crowd-sourced from large behavioral samples. GPT-3, prompted with instructions similar to those given human participants, segments continuous narrative text into events that correlate significantly with human annotations. More strikingly, GPT-3's boundaries are closer to the human consensus (averaged across annotators) than boundaries from individual human annotators.
This is not just a practical finding about automating event annotation. It suggests a deeper parallel between next-token prediction and human event cognition. Event Segmentation Theory proposes that humans track ongoing events through predictive models that update at event boundaries — moments when prediction error spikes because the situation has changed. Next-token prediction in language models follows an analogous structure: the model continuously predicts what comes next, and event boundaries correspond to points of high predictive uncertainty.
The "closer to consensus" finding has an elegant explanation: individual human annotators bring idiosyncratic biases (personal experience, attention fluctuations, interpretation differences). The consensus is obtained by averaging across annotators, canceling out individual noise. GPT-3, trained on massive text corpora, may have already averaged across the distributional regularities of many human writers' event descriptions — effectively pre-computing the consensus through training.
However, this may also reflect a limitation. Since Why do language models fail at communicative optimization?, the event segmentation capability may be a statistical regularity (event boundaries correspond to distributional shifts in text) rather than genuine event understanding. A model could identify event boundaries purely from lexical and structural cues without any understanding of what events are.
Source: Cognitive Models Latent
Related concepts in this collection
-
What three layers must discourse systems actually track?
Grosz and Sidner's 1986 framework proposes that discourse requires simultaneously tracking linguistic segments, speaker purposes, and salient objects. Understanding why all three are necessary helps explain where current AI systems structurally fail.
event segmentation adds a fourth potential component: temporal/narrative event structure
-
Why do language models fail at communicative optimization?
LLMs excel at learning surface statistical patterns from text but struggle with deeper principles of how language achieves efficient communication. What distinguishes these two types of linguistic knowledge?
event segmentation may be a statistical regularity rather than genuine event cognition
-
Can AI systems learn social norms without embodied experience?
Large language models exceed individual human accuracy at predicting collective social appropriateness judgments. Does this reveal that embodied experience is unnecessary for cultural competence, or do systematic AI failures point to limits of statistical learning?
parallel pattern: LLMs approximate collective human judgment better than individual humans
-
What semantic failures break dialogue coherence most realistically?
Can we distinguish distinct types of incoherence by manipulating semantic structure rather than surface text? This matters because text-level evaluations miss the semantic failures that actually occur in dialogue systems.
event segmentation provides temporal scaffolding for coherence: correctly segmented events make contradictions and coreference inconsistencies detectable within and across segments
-
Can tracking dialogue dimensions simultaneously reveal hidden conversation patterns?
Does encoding linguistic complexity, emotion, topics, and relevance as parallel temporal streams expose emergent patterns that traditional statistical analysis misses? This matters because conversation success may depend on interactions between dimensions, not individual features alone.
event segmentation produces distinct temporal signatures in Conversational DNA's multi-dimensional tracking: segment boundaries correspond to coordinated transitions in emotional trajectory, topic coherence, and linguistic complexity
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llms segment narrative events closer to human consensus than individual human annotators