Post-training makes large language models less human-like

Paper · arXiv 2605.07632

Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training – the stage that turns base models into useful assistants – consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction – a popular technique for eliciting human-like behavior by conditioning models on participant-specific information – does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.

Large language models (LLMs) such as ChatGPT, Claude, and Gemini have rapidly transformed the landscape of science and society, serving as powerful tools for writing, coding, and reasoning. Most current development is geared toward turning these models into useful assistants that provide normatively correct responses. Yet one of their most far-reaching applications lies elsewhere: faithfully mimicking human behavior, including its errors, variance, and the factors that shape it. These human-like models could be applied to simulate patient responses in mental health care, which, for instance, could train psychiatrists on challenging clinical cases in-silico. They could make it possible to anticipate how individuals and populations will respond to policy interventions even before those interventions are deployed. They could help model student learning trajectories in educational settings, thereby guiding the design of more effective and personalized curricula.

To systematically evaluate the behavioral alignment of LLMs with human responses, we introduce Psych-201, a novel dataset consisting of natural language transcripts from behavioral experiments. Psych-201 contains trial-by-trial data from individual participants, with each data sequence corresponding to the transcript of an entire experimental session (including instructions, stimuli, responses, and any task-relevant context presented during the session). Psych-201 was collected through an open research collaboration, resulting in a dataset of unprecedented scale and diversity. It includes data from 208,021 participants, 25,906,599 behavioral responses, and hundreds of experiments, making it 3.5× larger than its predecessor Psych-101 and 13× larger than the average mega-study. It is furthermore more diverse than Psych-101, both in terms of experimental paradigms and participant demographics.

In this study, we have systematically evaluated the alignment between human behavior and LLMs. The results show that post-training consistently reduces behavioral alignment across model families, sizes, and post-training objectives. While base models still continue to improve across model generations, post-training misalignment actually increases in newer models, highlighting the pressing nature of this issue. More broadly, our results can be viewed as a form of the alignment tax – a phenomenon whereby post-training can degrade model capabilities acquired during pretraining. While recent work has proposed methods to mitigate such issues on common benchmarks, our findings suggest that these mitigations do not extend to behavioral alignment with humans. These findings point to an important direction for future work: developing post-training methods that preserve the behavioral fidelity of base models while retaining the practical benefits of post-trained models.

Post-training makes large language models less human-like

Synthesis notes that discuss concepts related to this paper