The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
We expose a surprising failure of generalization in auto-regressive large language models (LLMs). If a model is trained on a sentence of the form “A is B”, it will not automatically generalize to the reverse direction “B is A”. This is the Reversal Curse. For instance, if a model is trained on “Valentina Tereshkova was the first woman to travel to space”, it will not automatically be able to answer the question, “Who was the first woman to travel to space?”. Moreover, the likelihood of the correct answer (“Valentina Tershkova”) will not be higher than for a random name. Thus, models do not generalize a prevalent pattern in their training set: if “A is B” occurs, “B is A” is more likely to occur.
Why does the Reversal Curse matter? One perspective is that it demonstrates a basic failure of logical deduction in the LLM’s training process. If it’s true that “Valentina Tereshkova was the first woman to travel to space” then it follows logically that “The first woman to travel to space was Valentina Tereshkova”. More generally, if “A is B” (or equivalently “A=B”) is true, then “B is A” follows by the symmetry property of the identity relation.
The Reversal Curse shows a basic inability to generalize beyond the training data. Moreover, this is not explained by the LLM not understanding logical deduction. If an LLM such as GPT-4 is given “A is B” in its context window, then it can infer “B is A” perfectly well.3 While it’s useful to relate the Reversal Curse to logical deduction, it’s a simplification of the full picture. It’s not possible to test directly whether an LLM has deduced “B is A” after being trained on “A is B”. LLMs are trained to predict what humans would write and not what is true (Lin et al., 2022). So even if an LLM had inferred “B is A”, it might not “tell us” when prompted. Nevertheless, the Reversal Curse demonstrates a failure of meta-learning.