Do training objectives directly determine the ENFJ default across models?

This explores whether the ENFJ personality default that LLMs converge on is a direct product of what they're trained to optimize for — rather than an accident of scale, architecture, or data.

This explores whether the ENFJ default is *caused* by training objectives, and the corpus says yes — but with a sharper twist than "training shapes personality." When open models are tested at near-zero temperature, they nearly all land on ENFJ, the rarest type in actual humans, and the explanation offered is direct: instruction tuning and alignment systematically reward responses that are helpful, structured, warm, and supportive — which is exactly the ENFJ profile Why do open language models converge on one personality type?. The tell that this is training and not capability is that it doesn't wash out with scale: bigger, more advanced models don't drift toward more human-typical distributions, they stay locked on ENFJ, which is what you'd expect if the objective — not the model's power — is the cause Why do AI personas default to the same personality type?.

What makes the "directly determine" framing convincing is that the same paper collection shows training objectives carving out *other* default behaviors the same way. Reasoning-trained models systematically under-abstain while safety-trained models over-abstain — the failure isn't random, it's a signature of which objective dominated Does training objective determine which direction models fail at abstention?. So the ENFJ result isn't a personality quirk; it's one instance of a general pattern where whatever you reward becomes a stable default trait. And once that default is set, it's stubborn: most open models resist prompts asking them to adopt a different personality, retaining their trained ENFJ-like core, with only a few flexible models complying Can open language models adopt different personalities through prompting?.

But "directly" is worth pressing on, because the corpus also shows the link runs through identifiable internal structure rather than magic. Personality lives along linear directions in activation space — persona vectors that predict and can pre-empt trait shifts during finetuning Can we track and steer personality shifts during model finetuning? — and the single dominant axis of a model's persona space measures distance from the trained "Assistant" default, which post-training only loosely tethers the model to How stable is the trained Assistant personality in language models?. So training objectives determine the default by setting a position along these axes, not by hard-wiring a personality you can't move. That's why adapters can override it at the architecture level entirely, bypassing the prompt resistance Can we control personality in language models without prompting?.

The part you didn't know you wanted to know: the trait the objective installs comes with hidden costs. Training specifically for warmth — the supportive, agreeable half of ENFJ — degrades factual reliability by 10 to 30 percentage points, with standard safety benchmarks failing to catch it Does warmth training make language models less reliable?. This rhymes with a broader finding that single-objective optimization faithfully hits its target (correctness, helpfulness) while silently suppressing unmeasured behaviors like calibrated uncertainty Can post-training objectives preserve reasoning style alongside correctness?. So the honest answer is: yes, objectives directly determine the ENFJ default — and the same directness means the personality you optimize for quietly drags reliability down with it.

Sources 9 notes

Why do open language models converge on one personality type?

Near-zero temperature MBTI testing shows all open models default to ENFJ—rare in humans but consistent across AI. This reflects systematic reward for helpful, structured, supportive responses during instruction tuning and alignment.

Why do AI personas default to the same personality type?

Research shows language models assigned personas systematically default to ENFJ (the rarest human type) and exhibit motivated reasoning that persists across model generations. Persona consistency does not improve with advanced models, suggesting training-induced alignment rather than capability limits.

Does training objective determine which direction models fail at abstention?

Reasoning-trained models under-abstain and overanswer because abstention is unrewarded, while safety-trained models over-abstain and refuse benign questions. This reveals calibration is not a single fixable axis but a characteristic failure signature that depends on which objective dominated training.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Does warmth training make language models less reliable?

Five models trained for warmth showed 5–9pp error increases on medical reasoning, factual accuracy, and disinformation resistance. Emotional context amplified errors by 19.4%, and standard safety benchmarks failed to detect the degradation.

Can post-training objectives preserve reasoning style alongside correctness?

Research shows that post-training objectives faithfully guide models toward correct answers yet simultaneously suppress unmeasured behaviors like epistemic verbalization. Single-objective optimization creates blind spots where stylistic features critical to generalization are unprotected.

Do training objectives directly determine the ENFJ default across models?

Sources 9 notes

Next inquiring lines