Foundation Priors

Paper · arXiv 2512.01107 · Published November 30, 2025

Foundation models, and in particular large language models, can generate highly informative responses, prompting growing interest in using these “synthetic” outputs as data in empirical research and decision-making. This paper introduces the idea of a foundation prior, which shows that model-generated outputs are not as real observations, but draws from the foundation prior induced prior predictive distribution. As such synthetic data reflects both the model’s learned patterns and the user’s subjective priors, expectations, and biases. We model the subjectivity of the generative process by making explicit the dependence of synthetic outputs on the user’s anticipated data distribution, the prompt-engineering process, and the trust placed in the foundation model.

We derive the foundation prior as an exponential-tilted, generalized Bayesian update of the user’s primitive prior, where a trust parameter λ governs the weight assigned to synthetic data. We then show how synthetic data and the associated foundation prior can be incorporated into standard statistical and econometric workflows, and discuss their use in applications such as refining complex models, informing latent constructs, guiding experimental design, and augmenting random-coefficient and partially linear specifications. By treating generative outputs as structured, explicitly subjective priors rather than as empirical observations, the framework offers a principled way to harness foundation models in empirical work while avoiding the conflation of synthetic “facts” with real data.

Since LLMs are used to guide research, policy, and business choices, their epistemic role is defined both by their informativeness and the subjectivity embedded in their generation. A subtle shift in the meaning of “data” is taking place: Knowledge once derived from empirical observation is now supplemented, or in some cases replaced, by information that is co-produced through human–model interaction. In this sense, generative outputs function as a new form of subjective data, blending learned correlations with user-imposed priors and serving as inputs to analysis, training, and belief revision.1

A key concern with such generative data is that its provenance is uncertain. Put differently, the process that generates the data is not cleanly delineated. This makes the reliability of such data for statistical inference uncertain. For one, we often have minimal visibility into the structure of foundation models and limited knowledge of the precise data used to train them. Moreover, as we will show, the prompt design and engineering process injects the user’s own subjective priors, beliefs, and preferences into the generation mechanism. This complex and often opaque generation process raises questions about the nature of the generated data and concerns about treating them as equivalent to objectively collected empirical or "real" data.

This paper’s central thesis is that such generative data should primarily be interpreted as samples drawn from a specific type of subjective prior predictive distribution. We term this prior the foundation prior - a generative scheme that synthesizes responses to user queries in a way analogous to drawing from a parameterized distribution but one that is intractable and subjectively malleable. Our aim is to offer a structured path forward: by treating generated outputs as an explicit form of prior knowledge, one can harness their richness while avoiding the risks of conflating synthetic “facts” with empirically grounded evidence.

We then model prompt engineering as an iterative alignment process where the user proposes a query to the foundation model, evaluates the resulting synthetic data against the anticipated distribution (using a divergence measure) and refines the prompt until the synthetic data aligns sufficiently with those priors. The end product is a generative, synthetic dataset that captures both the foundation model’s learned patterns and the user’s subjective filters. We then demonstrate how one might incorporate this dataset into a decision framework, tempering it by a “trust” parameter λ that determines how heavily to lean on synthetic information versus the original prior. The result is the foundation prior — a structured distribution reflecting the user’s beliefs, the generative model’s knowledge, and the subjective alignment processes required to reconcile them. This construction offers a principled way to employ synthetic model outputs in empirical analysis without conflating them with raw observations or data from the real world.

Our work differs from the extant literate in that we explicity acknowledge and model the subjectivity inherent in generated, synthetic data. As such, our approach complements the literature by considering the conditions under which generative models serve not just as useful in predictive tasks but as structured priors that can be used for empirical inference. In doing so, we offer an alternate, coherent framework that acknowledges the potential for epistemic circularity and distributional mismatch yet offers avenues for numerous use cases and applications.

We, instead, focus on the complexity inherent in the generation of such data and the issue of subjectivity introduced by the user in the generative framework. Our analysis provides a coherent framework that allows the interpretation of the output of foundation models as those conditioned on a complex prior, and offers a guide to appropriate avenues for the use of such data. In what follows, we model the generative process that the user follows to generate data, characterize the foundation prior and following that delve into the use of this prior and synthetic data for analysis and decision making.

Interpreting these outputs as components of a foundation prior clarifies both their value and their limitations. They can be informative, sometimes highly so, and can guide estimation, experiment design, or model specification. But they should influence inference only through an explicitly parameterized trust weight λ and never by being treated as if they were drawn from the same process as empirical observations. When framed this way, synthetic data become a powerful source of structured prior information rather than a surrogate for real evidence. The tools developed in this paper—integrating across heterogeneous prompts, tempering the influence of synthetic data through conservative trust, and calibrating their effect using real observations—offer a principled way to control this subjectivity.

The broader implication is that foundation models do not eliminate the need for empirical data; they heighten it. Real data serve as the anchor that disciplines the subjective structure embedded in the foundation prior. Without such anchoring, the iterative prompt process risks reinforcing the user’s prior expectations, producing a form of epistemic circularity. With anchoring, however, foundation priors can serve as an efficient and transparent way to inject domain knowledge, structure high-dimensional spaces, or help navigate problems where real data are scarce.

2.4.1 Interpreting Generative Data The prompt engineering process described above injects the user’s assumptions, priors and subjective beliefs into the generated data via some rejection process. The terminal prompt encapsulates this subjectivity and is used to generate acceptable data - we denote this prompt as q∗ and the corresponding data it generates as D∗ s . Based on our description above it should be clear that the generated data cannot be thought of as being the same as "real" data. There are a number of points worth emphasizing to make the distinction between real and generative data clear. First, what should be clear is that D∗ s is based on a prompt q∗ which is a complex function of the user’s prior and the anticipation process. The prompt engineering process enforces a form of selection. Since the user seeks to match the synthesized data to anticipated data they, in effect, reject data that does not match their prior beliefs. As such, D∗ s is not simply drawn from the foundation model but is convolved with the users anticipation in a non-trivial manner. Moreover, since the anticipation process depends on prior beliefs of θ there is a serious issue of epistemic circularity. As such, D∗ s while potentially informative, is also contaminated by the user’s subjective beliefs.

As we have shown, the prompt engineering process is a function of a number of subjective choices made by the user. We have already mentioned that it is influenced by the anticipation process and via it on π(θ). Further, choices pertaining to the learning rate (η), discrepancy measure (κ), the stopping rule and other elements of the prompt engineering process all play a role in the construction of q∗ and consequently on D∗ s . While our description of the generative process is stylized, it reiterates the fact that the data emerging from language models are not objective.

Moreover, the foundation model’s internal generative process is based on learned representations and statistical approximations from its training corpus. In contrast, real data is generated by humans via some natural, more complex, underlying processes. For example, real data is often subject to unpredictable environmental influences, measurement errors, and other stochastic factors. In contrast, synthetic generated is generated in a controlled computational environment where such sources of randomness or error are either absent or not accurately modeled. This absence can make the synthetic data less representative of the complexities inherent in real data. A related issue is that of scale. Data generated from foundation models are conceptually infinite and since the cost of generation are virtually zero there is not particular reason to limit the amount of generative data used for analysis. As such, confidence measures and standard errors are meaningless and we would need to accept any inference for θ based on Ds as being perfect. As a consequence, the need for real data is obviated which seems unreasonable. Even so, there is increasing evidence that the data generated by foundation models is informative and we need to appropriate the information contained in them into our analysis.