Why do AI systems miss jokes and wordplay so consistently?

Exploring whether AI's literal reading of language stems from how transformers process tokens in parallel rather than through selective frame-activation like humans do. Understanding this gap could reveal what cognitive operations current architectures lack.

Note · 2026-04-14

The transformer architecture processes a sequence of tokens through attention layers that compute relations across all token pairs. Information about the words is integrated, but the integration is parallel and additive — every token influences every other in proportion to the attention weights. There is no cognitive operation that suppresses some attention paths in order to surface the frame that holds a subset of tokens together. The mechanism does not do selective-resonance; it does weighted-aggregation.

This explains a recurring AI failure pattern. Given material that contains a frame activated by some words but not others, AI tends to read the material literally — taking each word at its compositional value rather than catching the frame the subset activates. The bullseye example illustrates: given "bullseye" applied to a design with a dot, a cover, and an arrow through it, AI reads "bullseye" as compliment-metaphor and misses the archery frame three of the four words activate. The miss is structural, not a knowledge gap. AI knows what "bullseye" is, knows what "arrow" is, knows what "dot" is. What it does not do is select these three for frame-activation while suppressing "cover."

This generalizes beyond wordplay. The same mechanism underlies AI difficulties with jokes (the punchline activates a frame that recontextualizes the setup), with poetry (image-clusters activate frames the literal words do not), with rhetoric (where a frame is built from selective material across a passage). Each of these depends on selective-resonance — the operation transformers do not perform. The miss is not "AI lacks world knowledge"; it is "AI lacks the selective-suppression operation that frame-activation requires."

Does the mind selectively activate frames from only some words? is the human-side companion. Together the two claims locate the difference precisely: not that AI lacks data or context, but that the cognitive operation human meaning-making relies on is not the operation transformers perform.

The strongest counterargument: better attention mechanisms, finer-grained attention heads, and explicit frame-extraction layers could close the gap. Possible but not yet evident. The gap appears even in the largest models with the most sophisticated attention, which suggests the operation needed is not just better attention but a different operation. Selective frame-activation may require something architecturally distinct from attention-as-weighted-aggregation.

Source: Making Sense - brief for co-authored essay on language

Related concepts in this collection

Concept map

13 direct connections · 131 in 2-hop network ·dense cluster

Why do AI systems miss jokes and wordplay so con… Does the mind selectively activate frames from onl… How do readers actually build meaning from words? Why don't conversational AI systems mirror their u…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

AI reads words literally one-at-a-time missing the frame that multiple words activate together