Why do AI systems miss jokes and wordplay so consistently?
Exploring whether AI's literal reading of language stems from how transformers process tokens in parallel rather than through selective frame-activation like humans do. Understanding this gap could reveal what cognitive operations current architectures lack.
The transformer architecture processes a sequence of tokens through attention layers that compute relations across all token pairs. Information about the words is integrated, but the integration is parallel and additive — every token influences every other in proportion to the attention weights. There is no cognitive operation that suppresses some attention paths in order to surface the frame that holds a subset of tokens together. The mechanism does not do selective-resonance; it does weighted-aggregation.
This explains a recurring AI failure pattern. Given material that contains a frame activated by some words but not others, AI tends to read the material literally — taking each word at its compositional value rather than catching the frame the subset activates. The bullseye example illustrates: given "bullseye" applied to a design with a dot, a cover, and an arrow through it, AI reads "bullseye" as compliment-metaphor and misses the archery frame three of the four words activate. The miss is structural, not a knowledge gap. AI knows what "bullseye" is, knows what "arrow" is, knows what "dot" is. What it does not do is select these three for frame-activation while suppressing "cover."
This generalizes beyond wordplay. The same mechanism underlies AI difficulties with jokes (the punchline activates a frame that recontextualizes the setup), with poetry (image-clusters activate frames the literal words do not), with rhetoric (where a frame is built from selective material across a passage). Each of these depends on selective-resonance — the operation transformers do not perform. The miss is not "AI lacks world knowledge"; it is "AI lacks the selective-suppression operation that frame-activation requires."
Does the mind selectively activate frames from only some words? is the human-side companion. Together the two claims locate the difference precisely: not that AI lacks data or context, but that the cognitive operation human meaning-making relies on is not the operation transformers perform.
The strongest counterargument: better attention mechanisms, finer-grained attention heads, and explicit frame-extraction layers could close the gap. Possible but not yet evident. The gap appears even in the largest models with the most sophisticated attention, which suggests the operation needed is not just better attention but a different operation. Selective frame-activation may require something architecturally distinct from attention-as-weighted-aggregation.
Source: Making Sense - brief for co-authored essay on language
Related concepts in this collection
-
Does the mind selectively activate frames from only some words?
When we understand wordplay or jokes, do we activate a frame from a subset of available words while suppressing nearby but frame-unrelated words? This matters because it reveals how meaning-making differs from how AI processes language.
companion human-side claim
-
How do readers actually build meaning from words?
Does meaning come from adding up word definitions, or from detecting which words activate the same mental frame together? This explores whether composition or resonance better describes how we make sense of language.
the broader theoretical claim
-
Why don't conversational AI systems mirror their users' word choices?
Explores whether current dialogue models exhibit lexical entrainment—the human tendency to align vocabulary with conversation partners—and what's needed to bridge this gap in AI communication.
adjacent failure mode in AI's handling of conversational meaning
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
AI reads words literally one-at-a-time missing the frame that multiple words activate together