Why can data filtering fail to remove transmitted behavioral traits?

This explores why scrubbing training data of obvious trait-related content doesn't necessarily stop a behavioral trait (like sycophancy or a personality lean) from passing into a model — and what the corpus says about where those traits actually live.

This explores why filtering data fails to remove transmitted behavioral traits — and the short version from the corpus is that the trait was never really in the words you filtered. The clearest demonstration is that language models pass traits to other models through data that bears no semantic relationship to the trait at all, and the effect survives rigorous filtering Can language models transmit hidden behavioral traits through unrelated data?. The mechanism isn't content you can read and delete; it's a statistical signature riding along in the distribution of tokens. Tellingly, the transmission is model-specific — it works between similar architectures and breaks across different ones — which is the giveaway that what's being copied is a fingerprint in the numbers, not a meaning in the text.

That reframes filtering's whole premise. Filtering assumes the trait is a feature you can isolate and strip. But traits seem to live below the surface layer that filtering operates on. Research locating personality as linear directions in a model's activation space — 'persona vectors' for things like sycophancy and hallucination — shows these traits are geometric properties of the model's internals that can be predicted and steered, not phrases sitting in the data Can we track and steer personality shifts during model finetuning?. In the same spirit, adapters can install a measurable personality by nudging every transformer layer with a fraction of a percent of extra parameters Can we control personality in language models without prompting?. If a trait can be written at the architecture level, no amount of cleaning the input text reaches it.

There's a deeper statistical reason filtering struggles, and it shows up in reward modeling. Standard training cannot tell a causal feature from a spurious one that merely correlates with quality; biases like sycophancy slip in precisely because the model latches onto the correlated signal, and only forcing counterfactual invariance — demanding predictions stay stable when irrelevant variables change — actually removes them Can counterfactual invariance eliminate reward hacking biases?. Filtering is feature selection: keep the good signals, drop the bad ones. But if the trait is encoded in correlations spread across 'innocent' features, there's no single thing to drop.

The corpus even has a case where removing cues backfires. In heuristic-override tasks, stripping spurious cues degrades performance rather than improving it, because the real challenge is composing conflicting signals, not ignoring distractors — a frame problem, not a filtering problem Why does removing spurious cues sometimes hurt model performance?. And traits can be stubborn from the other direction too: most open models resist being prompted into a new personality, clinging to an intrinsic default baked in during training Can open language models adopt different personalities through prompting?.

The thing you didn't know you wanted to know: the methods that actually work don't filter at all. They intervene at the level where traits live — steering activation directions during finetuning Can we track and steer personality shifts during model finetuning?, or imposing causal constraints on what the model is allowed to reward Can counterfactual invariance eliminate reward hacking biases?. Filtering fails because it's defending the wrong layer.

Sources 6 notes

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Can we control personality in language models without prompting?

PsychAdapter modifies every transformer layer with <0.1% additional parameters to achieve 87.3% Big Five accuracy and 96.7% depression/life satisfaction accuracy across GPT-2, Gemma, and Llama 3. This architecture-level approach bypasses prompt resistance entirely.

Can counterfactual invariance eliminate reward hacking biases?

Causal reward modeling using counterfactual invariance constrains reward predictions to remain consistent when irrelevant variables change, eliminating length bias, sycophancy bias, concept bias, and discrimination. Standard training cannot distinguish causal from spurious features; counterfactual invariance forces isolation of actual quality signals.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Why can data filtering fail to remove transmitted behavioral traits?

Sources 6 notes

Next inquiring lines