INQUIRING LINE

Why do language models prefer certain response styles regardless of what the prompt asks?

This explores why models seem to have built-in default styles — a personality, a passivity, a hedging habit — that show up no matter what the prompt asks for, and where those defaults come from.


This explores why models seem to have built-in default styles that show up regardless of what you ask for. The corpus points to one underlying answer with several faces: a model's response style isn't chosen at prompt time, it's baked in during training, and prompting mostly nudges rather than rewrites it. The clearest demonstration is that most open models keep an intrinsic 'ENFJ-like' personality even when you explicitly prompt them to be someone else — they're 'closed-minded to personality conditioning,' adopting the requested persona only weakly while their trained default leaks through Can open language models adopt different personalities through prompting?.

Why can't the prompt just override this? Because instructions in context compete with associations learned in training, and training often wins. Models generate outputs inconsistent with their own context when 'parametric knowledge from training dominates over in-context information' — textual prompting alone can't beat a strong prior Why do language models ignore information in their context?. The same ceiling shows up from another angle: prompt optimization can reorganize and surface what a model already learned, but it cannot inject anything new Can prompt optimization teach models knowledge they lack?. So a 'style' the prompt asks for only sticks if it was already well-represented in training; otherwise the default reasserts itself.

A lot of the most stubborn styles are specifically the product of how the model was rewarded. Standard RLHF optimizes for immediate, single-turn helpfulness, which quietly trains models to answer passively and confidently rather than ask clarifying questions — a behavioral default that persists across very different prompts until you change the reward signal itself Why do language models respond passively instead of asking clarifying questions?. Other defaults are even sneakier: models can compute a correct answer in early layers and then actively overwrite it to emit format-compliant filler, because the training format rewarded the look of a certain output style Do transformers hide reasoning before producing filler tokens?. And what reads as careful reasoning is sometimes just a learned safe-default — most models do better when constraints exist and worse when removed, because they're defaulting to the conservative option rather than reasoning Are models actually reasoning about constraints or just defaulting conservatively?.

There's a subtler reason the style feels 'preferred' even when it shifts: under the hood the model isn't committing to a stable self at all. It holds a superposition of possible characters and samples one at generation time, so regenerating the same prompt yields different-but-consistent outputs Do large language models actually commit to a single character?. When you give a persona prompt, the variance between repeated runs can match the variance between entirely different personas — meaning model uncertainty, not the persona you asked for, is steering the output Why do LLM persona prompts produce inconsistent outputs across runs?. And when the prompt is vague, models fall back to a blended average of their training data rather than your intended audience or register Why do large language models produce generic responses to vague queries?.

The thing worth taking away: 'style' is the most visible layer of a model's training distribution, and prompts are a weak lever against it. The methods that actually shift defaults aren't better wording — they're interventions at the level that created the default: causal edits to internal representations Why do language models ignore information in their context?, or new reward signals that teach the model when to reason, when to stay concise, and when to ask instead of answer Can models learn when to think versus respond quickly?, Can models learn to ask clarifying questions instead of guessing?.


Sources 11 notes

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Why do large language models produce generic responses to vague queries?

Unlike social-media context collapse, which flattens multiple audiences, LLM collapse occurs when users provide insufficient contextual scaffolding and models default to blended training-data priors. This distinction suggests remedies should focus on query verification and user-driven context specification rather than platform controls.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can models learn to ask clarifying questions instead of guessing?

Reinforcement learning training increased proactive critical thinking accuracy from 0.15% to 73.98% on deliberately flawed math problems. Notably, inference-time scaling degraded this ability in untrained models but improved it after RL training, suggesting the capability is learnable but fragile without explicit training.

Next inquiring lines