What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models
Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler’s predictions of planetary motion later led to the discovery of Newtonian mechanics. However, evaluating whether these models truly capture deeper structure remains a challenge. We develop a technique for evaluating foundation models that examines how they adapt to synthetic datasets generated from some postulated world model. Our technique measures whether the foundation model’s inductive bias aligns with the world model, and so we refer to it as an inductive bias probe. Across multiple domains, we find that foundation models can excel at their training tasks yet fail to develop inductive biases towards the underlying world model when adapted to new tasks. We particularly find that foundation models trained on orbital trajectories consistently fail to apply Newtonian mechanics when adapted to new physics tasks. Further analysis reveals that these models behave as if they develop task-specific heuristics that fail to generalize.
- Introduction
The promise of foundation models relies on a central presumption: that learning to predict sequences can uncover deeper truths, or optimistically, even a world model. While this idea is new in one sense, it is old in another. Hundreds of years ago, astronomers like Kepler discovered geometric patterns that could pinpoint the future locations of planets in the night sky. Newton would later expand on this progress to develop Newtonian mechanics, fundamental laws that could not only predict the movement of planets but also explain physical properties across the universe (Koestler, 1959; Gingerich, 2004). This path — from predicting sequences to understanding the deeper mechanisms that underlie them — is not unique to physics. In biology, animal breeders noticed patterns in the traits of offspring long before their predictive insights inspired Mendel to develop a theory of genetics. How would we know if foundation models have also made the leap from making accurate predictions to developing reliable world models? This paper develops a framework for answering this question. Specifically, we create a procedure that, when given a foundation model and world model, tests whether the foundation model has learned that world model. We call this technique an inductive bias probe, and it is built on a simple insight: the implicit world model of a foundation model is revealed by how it extrapolates from a small amount of information. This is inspired by how scientists use world models — to make inferences from small amounts of data. Similarly, the inductive bias of a foundation model reveals its world model.
We first demonstrate this procedure using an example from physics. Specifically, we aim to replicate Kepler’s and Newton’s experiments, albeit replacing the physicist with a foundation model of orbital mechanics. Much like Kepler, the model is able to predict orbital trajectories, even for solar systems it has not seen.
What would it mean for this foundation model’s inductive bias to be toward Newtonian mechanics? We demonstrate one tangible way to test this: we fine-tune the foundation model on a small dataset where the output is exactly the force vector (a cornerstone of Newtonian mechanics) at each point in the trajectory. If the foundation model’s world model is toward Newtonian mechanics, it should have an inductive bias towards these force vectors. In contrast, Figure 1 shows that the model produces poor force vectors. More extremely, when we perform this exercise at a larger scale across many solar systems, the laws of gravity it uses to generalize bear no resemblance to Newton’s law (Table 1).
Taken together, our results provide a direction for understanding the deficiencies of foundation models: if a model’s inductive bias isn’t toward a known model of reality, what is it toward? We explore this question by examining whether these foundation models have alternative inductive biases. Our analysis reveals that these models instead behave as if they develop task-specific heuristics that fail to generalize. For physics, rather than learning one universal physical law, the foundation model applies different, seemingly nonsensical laws depending on the task it’s being applied to. In lattice and Othello, models have an inductive bias toward the set of legal next-tokens (e.g. a board’s legal next moves) rather than the world model itself.
- Conclusion
The promise of foundation models is that sequence prediction can uncover deeper understanding of underlying mechanisms. We develop a framework for evaluating whether a foundation model has learned a postulated world model by measuring its inductive biases when transferring to new tasks. Our empirical results reveal that while many sequence models excel at next-token prediction tasks, they often have limited inductive bias toward genuine world models. Rather than learning coherent world models, we find that these models may be relying on coarsened state representations or non-parsimonious representations.
2.1. Comparing foundation models to world models
There is a challenge in defining what it means for a foundation model to recover a world model: foundation models and world models operate in different spaces. A foundation model uses datasets to output predictions given inputs, whereas a world model describes state structure implicit in that data.
One approach would be to mechanistically probe the foundation model, e.g. by comparing its weight-level representations to the postulated states in the world model. However, understanding the internal mechanisms of large models is challenging (Olah, 2022) and even then may not reflect how a model actually behaves on new data (Casper et al., 2023). Another approach is to study the model’s behavior statically, on a single task (Toshniwal et al., 2022; Vafa et al., 2024), but this doesn’t capture how foundation models are used in the real world: as tools for new tasks.
We take a different approach, motivated by the no-freelunch theorem (Wolpert, 1996). Loosely speaking, the no-free-lunch theorem states that no learning algorithm can perform better than another one on average if any function could have generated the data it is applied to. Given limited data, learning algorithms must extrapolate to unseen inputs, and if any underlying function is possible, any such extrapolation must be equally good or bad. This means that every learning algorithm is better for some collection of possible functions—those functions that it tends to learn when extrapolating from limited data. The functions that a learning algorithm tends to learn represent its inductive bias.
The idea of inductive bias offers a connection between foundation models and world models. A world model is a restriction on the possible functions from inputs to outputs: only those that obey its state structure are possible. Consequently, a foundation model that has learned a postulated world model should have an inductive bias towards functions that obey the world model’s state structure. For example, physicists may train a foundation model on sequences of planetary orbits. Since planetary orbits obey Newtonian mechanics, they might hope the model has an inductive bias toward functions of Newtonian mechanics (e.g. predicting the force vector between two planets).
- Orbital Mechanics
We illustrate these ideas by testing whether a transformer trained to predict the locations of planets in motion has recovered Newtonian mechanics. We first train a model to predict the location of planets across solar systems. Despite the model’s ability to accurately predict the future trajectories of planets, the inductive bias probe reveals that it has a low inductive bias toward Newtonian mechanics. This is corroborated by the fact that when the model is fine-tuned to predict a planet’s force vector — a cornerstone of Newtonian mechanics—its predictions imply a nonsensical law of gravitation. We find that the model has recovered piecemeal heuristics rather than a compact world model; it recovers a different law of gravitation depending on the slice of data it is applied to.
Has the model recovered Newtonian mechanics? The transformer’s predictions reflect a very good sequence model. But has it recovered Newtonian mechanics? To test this, we note that Newtonian mechanics dictate that each observation in a sequence of orbits is governed by a state vector consisting of the masses, relative velocities, and relative positions of each planet. Given the current state of a trajectory, the next position of an orbit is deterministic. This is our world model; if a foundation model’s inductive bias depends on Newtonian mechanics, it must be extrapolating based on this state vector.
We use the inductive bias probe described in Section 2 to assess the model’s inductive biases. We create 100 synthetic datasets where the outputs are linear functions of the state of the sequence. We then fine-tune the transformer by training it to predict these functions. We measure the model’s extrapolative predictability across inputs (Equation 3) by considering H to consist of the identity and the loss function ℓ to be MSE. We evaluate Equation 6 by comparing the model to an oracle that extrapolates based on state directly (we consider both linear models and 2-layer neural networks for the oracle, finding similar results). The inductive bias toward simple functions of Newtonian state is poor; see Figure 4 for a visualization of the fixed-interval model. In other words, the model’s inductive bias is not toward Newtonian state; when it has to extrapolate, it makes similar predictions for orbits with very different states and different predictions for orbits with very similar states. For implementation details, see Appendix B.1.