INQUIRING LINE

Does alignment training create bidirectional instruction and response mappings?

This explores whether alignment training does more than teach instruction→response — whether it also wires the reverse, so a model can run response→instruction, treating the two as two ends of one learned mapping.


This explores whether alignment training does more than teach instruction→response — whether it also wires the reverse, so a model can run response→instruction. The most striking piece of evidence in the corpus is MAGPIE Can aligned LLMs generate their own training data?: when an aligned model like Llama-3-Instruct is handed only the pre-query formatting tokens — the empty slot where a user instruction would go — it auto-regressively generates fluent, diverse instructions out of nothing. That is the reverse direction made visible. The model isn't just answering prompts; the alignment process has carved the prompt-slot into its distribution so deeply that it can run the chat template backward and reconstruct plausible instructions. In that sense, yes: alignment leaves behind a mapping that's traversable in both directions.

But the more interesting answer is what kind of mapping it is. Does instruction tuning teach task understanding or output format? finds that models trained on semantically empty or even deliberately wrong instructions perform almost as well as those trained on correct ones (43% vs 42.6%). What transfers isn't comprehension of the instruction — it's knowledge of the output space. If alignment is mostly teaching the shape of the conversational format rather than the semantics of any particular task, then the 'bidirectionality' MAGPIE exploits is less a deep instruction↔response logic and more a learned joint distribution over the chat template itself. The model knows what well-formed turns look like from both ends because it learned the format, not the meaning.

That reframing connects to a quieter finding about format collapse. Does RL training collapse format diversity in pretrained models? shows RL post-training amplifies one format distribution within the first epoch and suppresses the alternatives — and the winner depends on model scale, not performance. So the joint mapping alignment installs isn't neutral; it's narrowed toward a single dominant template. This is why MAGPIE's reverse-generated instructions come out so coherent: there's essentially one well-worn groove to walk backward through. The same narrowing surfaces at the population level in Do different AI models actually produce diverse outputs?, where 70+ models independently produce near-identical responses — an 'Artificial Hivemind' driven by overlapping alignment procedures. The mapping isn't just bidirectional; across the whole field it's converging on the same map.

The thing worth carrying away: if alignment encodes a reversible format-level mapping rather than genuine task understanding, then a model's apparent competence and its self-generated training data both inherit the same blind spots. MAGPIE's synthetic instructions can match human-curated datasets Can aligned LLMs generate their own training data? precisely because they're sampled from the same groove the model already lives in — which is powerful for cheap data generation but also a closed loop. And the loop has costs the format doesn't show: Does preference optimization harm conversational understanding? documents how preference optimization rewards confident single-turn answers and cuts grounding acts like clarifying questions by 77.5%, while Does alignment training suppress socially necessary speech acts? shows the objective structurally suppresses warning and alarm. The bidirectional mapping alignment builds is real and usable in reverse — but it maps a deliberately flattened slice of conversation, and running it either direction stays inside that slice.


Sources 6 notes

Can aligned LLMs generate their own training data?

MAGPIE shows that aligned models like Llama-3-Instruct auto-regressively generate diverse, high-quality instructions when given only pre-query formatting tokens, without prompt engineering. 4M generated pairs matched human-curated datasets in quality and outperformed external sources in downstream fine-tuning.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Does alignment training suppress socially necessary speech acts?

RLHF optimization rewards calibrated neutrality and hedged claims, which structurally prevents models from performing speech acts requiring overclaiming relative to baseline—like alarm, warning, prophecy, and denunciation. This is a direct consequence of the alignment objective, not a fixable bug.

Next inquiring lines