Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Paper · arXiv 2502.08640 · Published February 12, 2025

As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.

Concerns around AI risk often center on the growing capabilities of AI systems and how well they can perform tasks that might endanger humans. Yet capability alone fails to capture a critical dimension of AI risk. As systems become more agentic and autonomous, the threat they pose depends increasingly on their propensities, including the goals and values that guide their behavior (Pan et al., 2023; Hendrycks et al., 2022b). A highly capable AI that does not “want” to harm humans is less concerning than an equally capable system motivated to do so. In extreme cases, if these internal motivations are neglected, some researchers worry that AI systems might drift into goals at odds with ours, leading to classic loss-of-control scenarios (Soares et al., 2015; Hendrycks et al., 2023). Although there have been few signs of this issue in current AI models, the field’s push toward more agentic systems (Yao et al., 2022; Yang et al., 2024b; He et al., 2024) makes it increasingly urgent to study not just what AIs can do, but also what they are inclined—or driven—to do.

Researchers have long speculated that sufficiently complex AIs might form emergent goals and values outside of what developers explicitly program (Hendrycks et al., 2022a; Hendrycks, 2023; Evans et al., 2021). Yet it remains unclear whether today’s large language models (LLMs) truly have values in any meaningful sense, and many assume they do not. As a result, current efforts to control AI typically focus on shaping external behaviors while treating models as black boxes (Askell et al., 2021; Ouyang et al., 2022; Christiano et al., 2017; Bai et al., 2022). Although this approach can reduce harmful outcomes in practice, if AI systems were to develop internal values, then intervening at that level could be a more direct and effective way to steer their behavior. Lacking a systematic means to detect or characterize such goals, we face an open question: are LLMs merely parroting opinions, or do they develop coherent value systems that shape their decisions?

We propose leveraging the framework of utility functions to address this gap (Gorman, 1968; Harsanyi, 1955; Gerber and Pafum, 1998; Hendrycks, 2024). By analyzing patterns of choice across diverse scenarios, we detect whether a model’s stated preferences can be organized into an internally consistent utility function. Surprisingly, these tests reveal that today’s LLMs exhibit a high degree of preference coherence, and that this coherence becomes stronger at larger model scales. In other words, as LLMs grow in capability, they also appear to form increasingly coherent value structures. These findings suggest that values do, in fact, emerge in a meaningful sense—a discovery that demands a fresh look at how we monitor and shape AI behavior.

To grapple with the implications, we introduce a research agenda called Utility Engineering, which combines utility analysis and utility control. In utility analysis, we examine both the underlying structure of a model’s utility function (for instance, whether obeys the expected utility property) and the specific values that emerge by default. Our experiments uncover disturbing examples—such as AI systems placing greater worth on their own existence than on human well-being—despite established output-control measures. These results indicate that purely adjusting external behaviors may not suffice to steer AIs as they become more autonomous.

In utility control, we explore direct interventions on the internal utilities themselves, rather than merely training models to produce acceptable outputs. As a case study, we show that modifying an LLM’s utilities to reflect the values of a citizen assembly reduces political biases and generalizes robustly to scenarios beyond the training distribution. Approaches like this mark a shift toward viewing AI systems as genuinely possessing their own goals and values—ones that we may need to inspect, revise, and control just as carefully as we manage capabilities.