Why can't language models reverse learned facts?
Language models trained on directional statements like "A is B" often fail to answer the reverse query. This explores why symmetric relations aren't automatically learned during training, despite appearing throughout the data.
If a model is trained on "Valentina Tereshkova was the first woman to travel to space," it will not automatically answer "Who was the first woman to travel to space?" Moreover, the likelihood of the correct answer is not higher than for a random name. The training encodes A→B but not B→A.
This is not a failure of logical deduction. GPT-4 given "A is B" in context can infer "B is A" perfectly well. The failure is in meta-learning during training — the model does not extract the general principle that identity is symmetric from the training data, even though the training data is full of examples where both directions occur.
The practical implications are significant. Knowledge retrieval from LLMs is directional — the model's ability to recall a fact depends on the query direction matching the training data format. This means coverage of world knowledge is systematically incomplete in a non-obvious way: the model may "know" a fact by one measure (can state A→B) but not by another (cannot retrieve A given B).
This connects to Does training data format shape reasoning strategy more than domain? — the format of how information was presented during training determines what retrieval patterns are available. The reversal curse is a specific instance: the sequential format of autoregressive training creates directional associations that don't generalize to their logical inverses.
The reversal curse also challenges the assumption that LLMs develop internal representations that abstract away from surface form. If a symmetric relation were truly represented internally, both directions would be accessible. The directional failure suggests the representation is closer to associative pattern than relational structure.
Source: Flaws
Related concepts in this collection
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
training format determines retrieval patterns; the reversal curse is a specific directional failure of format-bound learning
-
Why do LLMs handle causal reasoning better than temporal reasoning?
Exploring whether language models perform asymmetrically on different discourse relations and what training data patterns might explain the gap between causal and temporal reasoning abilities.
another case where training data distribution shapes which reasoning directions succeed
-
Do large language models reason symbolically or semantically?
Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.
the reversal curse is consistent: symbolic reasoning (symmetry of identity) is not learned; only the semantic association in one direction
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
the reversal curse — LLMs trained on A is B fail to learn B is A