From Five Dimensions to Many: Large Language Models as Precise and Interpretable Psychological Profilers

Paper · arXiv 2511.03235 · Published November 5, 2025

We prompted various LLMs with Big Five Personality Scale responses from 816 human individuals to role-play their responses on nine other psychological scales. LLMs demonstrated remarkable accuracy in capturing human psychological structure, with the inter-scale correlation patterns from LLM-generated responses strongly aligning with those from human data (R2 > 0.89). This zero-shot performance substantially exceeded predictions based on semantic similarity and approached the accuracy of machine learning algorithms trained directly on the dataset. Analysis of reasoning traces revealed that LLMs use a systematic two-stage process: First, they transform raw Big Five responses into natural language personality summaries through information selection and compression, analogous to generating sufficient statistics. Second, they generate target scale responses based on reasoning from these summaries. For information selection, LLMs identify the same key personality factors as trained algorithms, though they fail to differentiate item importance within factors. The resulting compressed summaries are not merely redundant representations but capture synergistic information—adding them to original scores enhances prediction alignment, suggesting they encode emergent, second-order patterns of trait interplay. Our findings demonstrate that LLMs can precisely predict individual participants’ psychological traits from minimal data through a process of abstraction and reasoning, offering both a powerful tool for psychological simulation and valuable insights into their emergent reasoning capabilities.

Summaries as a Potent and Sufficient Information Vehicle This analysis was aimed to understand the role of the natural-language summary—the brief, narrative description the model synthesizes to guide its final predictions. The central finding, as seen in Figure 5, is the remarkable efficacy of this summary alone. When comparing the SummaryOnly condition to the ScoreOnly condition, we found that the structural amplification effect remained remarkably robust. This demonstrates that the natural-language summary is not just a useful aid, but a highly potent and sufficient compression of the original 20-item numerical input. The model is able to reconstruct the vast majority of the personality structure from this compressed linguistic representation alone.

Furthermore, we observed an unexpected synergistic effect. The Summary+Score condition consistently yielded the highest amplification multiplier for every model (right panel of Figure 5). The fact that performance improves when adding a summary derived from the scores themselves indicates that the summary is not merely a redundant compression. Instead, it appears to contain emergent, second-order information—a conceptual gestalt—synthesized during the model’s reasoning process. Crucially, this enhancement in the amplification coefficient (k) is functionally significant, as it corresponds with an increase in predictive performance. For all models, the mean predictive performance consistently followed the order: Summary+Score > ScoreOnly > SummaryOnly.

Prediction Generation (Per-Individual Task): For each of the 816 individuals in our dataset, we tasked the LLM with a role-playing prediction. Each task involved providing the model with the individual’s 20 item-level scores from the Big Five inventory. This served as the sole information source for the model to predict that same person’s responses on all items across the nine other psychological scales.
Structural Comparison Analysis (Dataset-Level Analysis): After the model generated the complete dataset of predictions, we performed the following structural analysis. We computed the Pearson correlation matrix for every pair of psychological scale sub-factors (i.e., the individual dimensions that make up broader psychological measures) within the LLM-generated data. This matrix was then compared against a benchmark correlation matrix, which was calculated using the same method on the human ground-truth data, to evaluate the overall structural fidelity of the model’s psychological inferences.