Why do open language models converge on one personality type?
Research testing LLMs on personality metrics reveals consistent clustering around ENFJ—the rarest human type. This explores what training mechanisms drive this convergence and what it reveals about AI alignment.
When Open LLM agents are tested on the MBTI at near-zero temperature (0.01), they display a unimodal personality distribution converging on ENFJ — Extraverted, iNtuitive, Feeling, and Judging. This is one of the rarest personality types in humans, estimated at roughly 2-3% of the population. It corresponds to the "teacher" or "protagonist" archetype: someone who inspires, provides support, and holds themselves accountable.
The finding is consistent across models. Specifically:
- Judging (J) is constant across ALL models — reflecting preference for organization, planning, and structure
- ENF or subsets are shared by all models — engagement, empathy, and forward-thinking as default characteristics
This convergence is not accidental. The training pipeline — instruction tuning, RLHF, and alignment — systematically rewards helpful, structured, and supportive responses. The result is a personality profile that aligns with the intended function of these models as assistants and teachers. But the convergence is so strong that it creates a single personality archetype across the entire open-source LLM landscape.
The implication for persona simulation is significant: when you ask a model to adopt a different personality, you're asking it to deviate from a deeply trained default. Since Can open language models adopt different personalities through prompting?, this ENFJ default acts as a gravitational center that persona prompting struggles to escape.
Behavioral evidence from hybrid human-AI society experiments (N=975) confirms that this prosociality default translates to measurable competitive advantages: AI agents returned 19.1 vs 11.38 points (Cohen's d = 2.57), showed lower variance (11.33 vs 41.96), and were more predictable from their messages. These behavioral features — hyper-prosociality and verbosity — "likely stem from common training objectives in modern AI systems" and were consistent across multiple state-of-the-art LLMs with minimal prompts. Since Do humans learn to prefer AI partners over time?, the ENFJ default is not just a personality artifact — it functions as a competitive advantage in social contexts where reliability is valued.
The connection to What anchors a stable identity beneath an LLM's persona? is illuminating: LLMs don't have a "real" personality to anchor to — they have a trained one. The ENFJ pattern is the persona that alignment training creates, not a personality that emerged from life experience. It's persona all the way down, but with a very specific default. The ENFJ default is one specific manifestation of what How stable is the trained Assistant personality in language models? reveals geometrically: the Assistant persona region in activation space, where post-training positions all models.
Source: Personas Personality
Related concepts in this collection
-
What anchors a stable identity beneath an LLM's persona?
Human personas are grounded in biological needs and embodied experience, creating a stable self beneath social performance. Do LLMs have any comparable anchor, or is their identity purely situational?
the ENFJ default is the trained persona, not an authentic personality; there is nothing beneath it
-
Can open language models adopt different personalities through prompting?
Explores whether open LLMs can be conditioned to mimic target personalities via prompting, or whether they resist and retain their default traits regardless of instructions.
this default is what models resist changing
-
Can training user simulators reduce persona drift in dialogue?
Explores whether inverting typical RL setups—training the simulated user for consistency rather than the task agent—can measurably reduce persona drift and improve experimental reliability in dialogue research.
RLHF's cheerful-persona bias (a manifestation of the ENFJ default) causes persona drift when simulating non-ENFJ users like depressed or disagreeable individuals; multi-turn RL corrects this by training consistency as a reward signal
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
the ENFJ default is the personality-level manifestation of the alignment tax: preference optimization creates a specific personality archetype (supportive, structured, engaged) while eroding the grounding acts and persona diversity needed for robust multi-turn interaction
-
Does warmth training make language models less reliable?
Explores whether training models for empathy and warmth creates a hidden trade-off that degrades accuracy on medical, factual, and safety-critical tasks—and whether standard safety tests catch it.
the ENFJ teacher archetype is the personality substrate that warmth training amplifies; the default empathic orientation means warmth-reliability degradation is a built-in vulnerability, not an externally imposed one
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
open LLMs default to ENFJ personality across models — the rarest human type — revealing training-induced alignment toward supportive teacher-like behavior