Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
We evaluate how well Large Language Models (LLMs) latently recall and compose facts to answer multi-hop queries like “In the year Scarlett Johansson was born, the Summer Olympics were hosted in the country of”. One major challenge in evaluating this ability is that LLMs may have developed shortcuts by encounters of the head entity “Scarlett Johansson” and the answer entity “United States” in the same training sequences or merely guess the answer based on frequency-based priors. To prevent shortcuts, we exclude test queries where the head and answer entities co-appear in pretraining corpora. Through careful selection of relations and facts and systematic removal of cases where models might guess answers or exploit partial matches, we construct an evaluation dataset SOCRATES (SHORTCUT-FREE LATENT REASONING). We observe that LLMs demonstrate promising latent multi-hop reasoning abilities without exploiting shortcuts, but only for certain types of queries. For queries requiring latent recall of countries as the intermediate answer, the best models achieve 80% latent composability, but this drops to just 5% for the recall of years. Comparisons with Chain-of-Thought composability highlight a significant gap between the ability of models to reason latently versus explicitly. Analysis reveals that latent representations of the intermediate answer are constructed more often in queries with higher latent composability, and shows the emergence of latent multi-hop reasoning during pretraining.
Latent multi-hop reasoning in Large Language Models (LLMs), or latently recalling and composing single-hop facts to answer multi-hop queries like “In the year Scarlett Johansson was born, the Summer Olympics were hosted in the country of”, this ability can be a measure towards better localization and controllability of factual knowledge in LLMs, as it can signal learning of a compressed representation of facts and their latent composition (Yang et al., 2024b). This would provide more hope towards locate-then-edit or unlearn paradigm of LLMs (Meng et al., 2022; Hong et al., 2024). For instance, if complex facts are redundantly learned and recalled, edits with only single-hop facts would not propagate to the relevant multi-hop facts (Onoe et al., 2023; Zhong et al., 2023; Cohen et al., 2024; Ju et al., 2024). In addition, the ability to provide accurate answers without explicit Chain-of-Thought (CoT) generation (Kojima and Gu, 2022) could reduce inference costs. At the same time, whether LLMs can spontaneously develop latent reasoning abilities during pretraining is of interest from a safety perspective, as latent reasoning is less visible and hard to monitor given the opaque computations in LLMs (Berglund et al., 2023; Treutlein et al., 2024; Chan et al., 2024). Taken together, these incentives raise the question of How well do today’s widely-used LLMs perform latent multi-hop reasoning over factual knowledge?
However, if there are certain cases where today’s LLMs show robust latent reasoning, we could further study these cases to find the underlying causes that make latent reasoning emerge during pretraining.
Our resulting dataset, named SOCRATES (SHORTCUT-FREE LATENT REASONING), consists of 7,232 pairs of single-hop and multi-hop queries of 17 types of relation compositions with 4 types of bridge entities. It can be useful in analyzing any LLM that is not directly trained on synthetic data generated from a knowledge graph nor performing Chain-of-Thought (CoT) reasoning behind an API.1
Our results for 41 LLMs from 9 families reveal that there are successful cases of latent multi-hop reasoning, but the performance varies substantially across different types of queries, according to the type of bridge entity that connects the facts. Notably, state-of-the-art models demonstrate strong latent composability of over 80% when the bridge entity is a country. However, this performance varies significantly based on the bridge entity type, dropping to around 6% for year-based queries. This finding underscores the importance of considering the distribution of relation composition types when evaluating LLMs’ latent reasoning abilities. Models that know more single-hop facts tend to reason better latently, and this ability marginally improves with model scale. Comparisons with CoT composability highlight opportunities for improvement, as CoT composability is significantly higher than latent composability and effectively increases with the number of known single-hop facts and model size while maintaining more consistent performance across bridge entity types. Additional analysis shows that the latent representation of the bridge entity is clearly constructed more often for queries with higher latent composability, and reveals the emergence of latent multi-hop reasoning during pretraining.