Does the Assistant Axis exist in pre-trained models before instruction tuning?

This explores whether the dominant 'Assistant' direction in a model's persona space is something instruction tuning *creates*, or a latent structure already present in the raw pretrained model that fine-tuning merely surfaces and amplifies.

This explores whether the Assistant Axis is built by instruction tuning or already latent in the base model. The corpus doesn't run that exact experiment, but several notes triangulate toward a surprising answer: much of what looks like 'trained behavior' turns out to be pre-existing structure that fine-tuning surfaces rather than installs. The starting point is How stable is the trained Assistant personality in language models?, which finds that after post-training the leading dimension of a model's persona space measures distance from a default Assistant — and notably that this is a *loose* tether, with predictable drift under emotional or self-reflective conversation. A loosely-held axis is the signature of something amplified, not something rigidly authored.

The strongest hint that the axis predates instruction tuning comes from work on what fine-tuning actually changes. Does instruction tuning teach task understanding or output format? shows models trained on semantically empty or even wrong instructions perform almost identically to correctly-instructed ones (43% vs. 42.6%) — what transfers is knowledge of the *output space*, not new capability. If instruction tuning mostly teaches a model which region of its own distribution to speak from, then the 'Assistant' mode it lands in must already be reachable in the base weights. Does RL training collapse format diversity in pretrained models? sharpens this: RL doesn't invent a format, it amplifies one format already present in pretraining within the first epoch while suppressing the others. The dominant post-training persona, on this reading, is a pretraining distribution that won a competition — not a new construct.

The most direct evidence is Can aligned LLMs generate their own training data? (MAGPIE): aligned models, fed nothing but the pre-query formatting tokens that mark the start of an assistant turn, auto-regressively generate coherent user queries and assistant answers. The Assistant 'voice' is so tied to position in the token stream that the empty template alone evokes it — consistent with a stable internal direction rather than a behavior bolted on case by case.

The counterweight is Do pretraining and fine-tuning scale independently in language models?, which finds an architectural division of labor: pretraining enriches lower-layer factual knowledge, while fine-tuning modifies *upper-layer behavior expression*. That suggests the helpful-Assistant *behavior* is genuinely a fine-tuning product. The reconciliation worth taking away: the raw material — a latent persona direction the base model can already occupy — likely exists before instruction tuning, but the alignment process is what selects it, sharpens it into the dominant axis, and wires it to the assistant-turn position. Tools like Can we track and steer personality shifts during model finetuning? show these trait directions are concrete linear features you can locate and steer, which is exactly the kind of object you'd expect to be able to probe in a base model to test this directly — a doorway if you want to see whether anyone has measured the axis pre-tuning rather than inferred it.

Sources 6 notes

How stable is the trained Assistant personality in language models?

Research mapping hundreds of character archetypes reveals a low-dimensional persona space where the leading component measures distance from the default Assistant. Emotional and meta-reflective conversations cause predictable drift, but activation capping along this axis mitigates harmful shifts without degrading capabilities.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can aligned LLMs generate their own training data?

MAGPIE shows that aligned models like Llama-3-Instruct auto-regressively generate diverse, high-quality instructions when given only pre-query formatting tokens, without prompt engineering. 4M generated pairs matched human-curated datasets in quality and outperformed external sources in downstream fine-tuning.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Can we track and steer personality shifts during model finetuning?

Research identifies linear directions in LLM activation space corresponding to specific traits like sycophancy and hallucination. These persona vectors predict finetuning-induced personality shifts before they occur and can preventatively steer training to avoid unwanted trait changes.

Does the Assistant Axis exist in pre-trained models before instruction tuning?

Sources 6 notes

Next inquiring lines