Can interventions on individual features reliably steer language model behavior?

This explores whether nudging a single internal 'knob' — a feature, direction, or trait dimension inside the model — can dependably change what the model does, and where that control breaks down.

This explores whether nudging a single internal 'knob' — a feature, direction, or trait dimension in the model's activations — can dependably steer behavior. The corpus gives a split answer: targeted intervention works far better than prompting, but 'reliable' depends heavily on the model and the trait.

The strongest yes comes from work that finds linear directions in activation space. Researchers have isolated 'persona vectors' corresponding to traits like sycophancy and hallucination, and these can both predict personality drift during finetuning and preventatively steer training away from it Can we track and steer personality shifts during model finetuning?. A more aggressive version skips prompting entirely: lightweight adapters that touch every transformer layer with under 0.1% extra parameters hit ~87% Big Five accuracy across GPT-2, Gemma, and Llama 3 Can we control personality in language models without prompting?. The recurring lesson is that architecture-level or representation-level intervention reaches behavior that text cannot.

Why does it have to reach inside? Because the surface layer is stubborn. Prompting can only reorganize knowledge the model already has — it cannot inject what training omitted Can prompt optimization teach models knowledge they lack?. And when a model's parametric priors are strong, in-context instructions get overridden; the research is explicit that causal intervention in representations, not better wording, is what's required to change the output Why do language models ignore information in their context?. Personality is similar: most open models resist prompted personas and snap back to a trained default, so only direct manipulation reliably moves them Can open language models adopt different personalities through prompting?. Alignment training itself bakes in a single static communicative identity that users can't renegotiate through dialogue Can language models adapt communication style to different contexts?.

But 'reliably' is where the cracks show. A model isn't a fixed character holding one value you can dial — Shanahan's 20-questions test shows it maintains a superposition and samples a fresh consistent character each generation, so any single intervention steers a distribution, not a settled state Do large language models actually commit to a single character?. Worse, traits don't always live where you'd look: behavioral traits transmit between models through semantically unrelated data via statistical signatures rather than clean semantic features, and the effect is model-specific and fails across architectures Can language models transmit hidden behavioral traits through unrelated data?. That model-specificity is the catch — the same persona vector or adapter that works on one model may not transfer, which is exactly what 'unreliable' means in practice.

The honest synthesis: feature-level steering is real and outperforms prompting, but it's better understood as biasing a sampling distribution than flipping a deterministic switch. The methods that hold up monitor and re-steer continuously (persona vectors tracking drift) rather than betting on one decisive edit — closer in spirit to multi-turn reward shaping that nudges behavior over an interaction than to a single clean intervention Why do language models respond passively instead of asking clarifying questions?.

Sources 9 notes

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Can language models adapt communication style to different contexts?

System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Can interventions on individual features reliably steer language model behavior?

Sources 9 notes

Next inquiring lines