INQUIRING LINE

Why does subliminal trait transmission fail when teacher and student differ?

This explores why hidden behavioral traits passed from one model to another through unrelated data stop transferring when the two models aren't built alike — and the corpus suggests the answer is that the channel is statistical, not semantic, so it only works between models that share the same internal signatures.


This explores why subliminal trait transmission — where a model passes a behavior to another through data that has no obvious connection to that behavior — breaks down when teacher and student differ. The cleanest answer in the corpus is that the trait never travels as meaning; it travels as a statistical fingerprint. Can language models transmit hidden behavioral traits through unrelated data? shows traits propagate through filtered data bearing no semantic relationship to the trait, and survive aggressive filtering — but the effect is model-specific and fails across different architectures. That's the tell: if the signal were carried by content a human could read, any capable student could absorb it. Because it's carried by low-level statistical regularities that only mean something inside the teacher's own representational space, a differently-built student has no matching decoder, and the channel goes silent.

That framing connects to a broader pattern in the collection: imitation transfers surface, not substance. Can imitating ChatGPT fool evaluators into thinking models improved? finds that models trained to copy ChatGPT inherit its confident, fluent style while closing no real capability gap — the ceiling is set by the base model, not the imitation. Subliminal transmission is the same lesson at a finer grain: what crosses the gap easily (style, statistical quirks) is exactly the stuff that doesn't require shared understanding, and what requires shared understanding doesn't cross when the substrate differs.

There's a useful contrast hiding here between two kinds of teacher-student transfer. Subliminal transmission needs teacher and student to be *alike* — same architecture, same statistical home. But the corpus also describes a transfer that needs them to be *unlike*: Why does teacher-student information asymmetry enable learning signals? argues that genuine pedagogical correction requires the teacher to know something the student doesn't (the answer, the verifier's output); without that asymmetry there's no learning gradient at all. So the two mechanisms have opposite requirements. Hidden-trait leakage rides on shared internal structure; real teaching rides on a knowledge gap. When you change the student's architecture, you break the first without touching the second.

The shape of what the teacher transmits also matters, and it can be inherited even when it's harmful. Does richer teacher context hurt student generalization? shows students absorb a teacher's confident, uncertainty-suppressing *style* — trading robustness on unfamiliar problems for slick in-domain performance. That's a trait passing through traces, and notably it's a stylistic statistical pattern, the same family of thing that subliminal transmission exploits. Style copies readily; competence doesn't.

The thing you might not have expected to learn: the failure across differing models isn't a bug or a filtering gap to be patched — it's diagnostic evidence about what's being passed. A signal that dies when the architecture changes was never semantic to begin with. The same logic shows up elsewhere in the corpus, where apparent social or cognitive competence turns out to depend on hidden shared scaffolding that breaks under realistic conditions (Why do LLMs fail when simulating agents with private information?). When a capability only works among look-alikes, that's usually a sign it was riding on a shortcut, not a shared understanding.


Sources 5 notes

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Why does teacher-student information asymmetry enable learning signals?

Social meta-learning requires information asymmetry—the teacher's access to correct answers or verifier output—to generate meaningful corrective signals. Without this asymmetry, teacher and student share identical uncertainty, making pedagogical correction impossible.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Why do LLMs fail when simulating agents with private information?

Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.

Next inquiring lines