Do large language models resemble humans in language use?

Paper · arXiv 2303.08014 · Published March 10, 2023

regularities in language range from phonology to pragmatics. For example, people associate different sounds with different referents (e.g., Köhler, 1929), automatically reinterpret implausible sentences (e.g., Gibson et al., 2013), and expect demographically appropriate content from speakers (e.g., Van Berkum et al., 2008). Do LLMs share these regularities in language use? Piantadosi (2023) pointed out that LLMs integrate syntax and semantics (i.e., all aspects of usage are represented in a single vector space), so other humanlike regularities in language use might emerge along with grammaticality and coherence. Supporting the hypothesis that humans and LLMs have underlying similarities, representations from intermediate layers of LLMs (rather than initial layers, which are not contextualized, or output layers) are highly predictive of activation in language-selective brain regions when humans and LLMs process the same passage of text (Caucheteux & King, 2022; Schrimpf et al., 2021). Whether LLMs have developed humanlike regularities will not only gauge the success of computational research into natural language processing, but will be important for cognitive scientists. For instance, as with universal grammar, it is a matter of debate whether regularities in language processing arise from innate constraints or statistical learning. Given that LLMs do not have built-in linguistic rules, they allow us to test which regularities in language use can be recovered from patterns in language—at least in principle, with an abundance of training data.

When guessing whether a nonword refers to a round or spiky shape, both LLMs gave more round judgements for nonwords judged by humans to sound round (e.g., maluma) than spiky (e.g., takete).

Sounds: sound-gender association

When completing a sentence preamble containing a novel name (e.g., Although Pelcrad/Pelcra was sick...), both LLMs used more feminine pronouns (i.e., she/her/hers) to refer to the novel name ending in a vowel than a consonant.

Words: word length and predictivity

Humans more often choose the shorter of two words with very similar meanings (e.g., mathematics vs math) when completing a sentence that is predictive of the word’s meaning (e.g., Susan was very bad at algebra, so she hated…) than when completing a neutral sentence (e.g., Susan introduced herself to me as someone who hated…). Neither LLM exhibited this tendency.

Words: word meaning priming

After an ambiguous cue word (e.g., post), both LLMs supplied an associate of a particular meaning of that word (e.g., position, rather than mail) more often when they had earlier read a sentence that used that meaning of the ambiguous word (e.g., The man accepted the post in the accountancy firm) than when they had read a sentence containing a synonym (e.g., The man accepted the job in the accountancy firm) or when they had not read a sentence containing that meaning.

Syntax: structural priming When completing a prime preamble (e.g., The racing driver gave the torn overall ...) and then a target preamble (e.g., The patient showed . . .), both ChatGPT and Vicuna tended to use the same syntactic structure in the prime and the target. For example, they were more likely to compete a target preamble into a prepositional-objective (PO) dative sentence (e.g., The patient showed his hand to the nurse) after completing a prime preamble into a PO sentence (e.g., The racing driver gave the torn overall to his mechanic) than after completing a prime preamble into a double-object (DO) dative sentence (e.g., The racing driver gave the helpful mechanic a wrench). In both models, the priming effect was enhanced when the prime and the target had the same verb (showed) compared to different verbs (showed vs gave).

Syntax: syntactic ambiguity resolution

Humans more often interpret the syntactically ambiguous phrase such as with the rifle in The hunter shot the dangerous poacher with the rifle as modifying the noun (i.e., the poacher had the rifle) than the verb (i.e., the hunter used the rifle) when the discourse has introduced multiple potential referents than a single referent for the dangerous poacher (e.g., There was a hunter and two poachers / a poacher). Neither LLMs exhibited this tendency.

Meaning: implausible sentence interpretation

After reading an implausible sentence in a DO structure (e.g., The mother gave the candle the daughter) or in a PO structure (e.g., The mother gave the daughter to the candle), ChatGPT, but not Vicuna, was more likely to nonliterally interpret the implausible DO than PO sentence (e.g., treating the daughter as the recipient of the candle).

Meaning: semantic illusion

ChatGPT, but not Vicuna, noticed fewer errors when sentences contained incongruent words that were semantically close to congruent words (e.g., Snoopy is the black and white cat in what famous Charles Schulz comic strip?) than when they were semantically distant (e.g.,

Snoopy is the black and white mouse in what famous Charles Schulz comic strip?; in fact, Snoopy is a dog).

Discourse: implicit causality

When completing a sentence preamble (Gary scared/feared Anna because …) into a full sentence, both LLMs were more likely to attribute the causality of the preamble evenet to the object rather than the subject (e.g., Anna rather than Gary) when the preamble had a stimulus-experiencer verb with the subject serving as the stimulus and the object serving as the experiencer (e.g., Gary scared Anna because he was violent) than when the preamble had an experiencer-stimulus verb (e.g., Gary feared Anna because she was violent).

Discourse: drawing inferences

ChatGPT, but not Vicuna, was more likely to make inferences that connect two pieces of information (e.g., Sharon stepped on glass. She cried out for help.) than to make inferences that elaborate on a single piece of information (e.g., Sharon stepped on glass. She was looking for a watch.), in response to a question (e.g., Did she cut her foot?).

Interlocutor sensitivity: word meaning access

When supplying an associate to a word with different meanings in different dialects (e.g., bonnet meaning “car-part” in British English but “hat” in American English), both LLMs were more likely to access the American English meaning when the interlocutor self-identified as an American English speaker than as a British English speaker.

Interlocutor sensitivity: lexical retrieval

When asked to supply a word/phrase according to a provided definition (e.g., a housing unit common in big cities that occupies part of a single level in a building block), both LLMs were more likely to retrieve an American expression instead of a British one (e.g., apartment vs. flat) when the interlocutor identified him/herself as an American English speaker than as a British English speaker.