Does foundational model training or user priors more strongly shape final outputs?

This explores a tug-of-war: when a model produces an answer, how much is determined by what it absorbed during pretraining versus what the user supplies at the moment — the prompt, the context, the instructions — and the corpus leans hard in one direction.

This reads the question as a contest between two forces shaping any given output: the priors baked in during foundational training, and the priors the *user* brings at inference time through prompts and context. Across the collection, the verdict is lopsided — training-time priors usually win, and the user's leverage is narrower than it feels. The recurring finding is that user-side interventions *select from* or *activate* what's already in the model rather than installing anything new. Prompt optimization, for instance, can reorganize and surface existing knowledge but hits a hard ceiling — it 'cannot inject new knowledge it can only activate knowledge that' is already there Can prompt optimization teach models knowledge they lack?. The same shape shows up in reasoning: base models already carry latent reasoning ability, and post-training merely elicits it rather than creating it Do base models already contain hidden reasoning ability?.

The sharpest evidence that training dominates the user is the failure of context to override parametric memory. When a model's pretraining built strong associations, it will ignore what you put in the prompt and answer from memory instead — and crucially, 'textual prompting alone cannot override strong priors' Why do language models ignore information in their context?. That's a direct ranking of the two forces: the user's words lose to the model's history unless you reach past the prompt and intervene in the model's internal representations. So the honest answer to 'which is stronger' is: foundational training sets the gravity well, and ordinary user input rarely escapes it.

What's surprising is how *little* of the user's intended meaning even matters during training-time shaping. Instruction tuning, it turns out, mostly teaches a model the shape of valid outputs rather than the content of the task — models trained on semantically empty or deliberately wrong instructions score about the same as those trained on correct ones Does instruction tuning teach task understanding or output format?. And the priors that win aren't even chosen for being good: RL post-training collapses onto a single dominant pretraining format, and which format wins 'depends on model scale, not necessarily performance' Does RL training collapse format diversity in pretrained models?. The foundational distribution isn't just strong — it's arbitrary in ways the user can't see or steer.

The corpus does map where user-side or post-training forces *can* genuinely reshape outputs — but each comes with friction. Even adaptation has hidden costs: domain training methods have narrow 'sweet spots,' and visible gains often hide degradation in reasoning faithfulness and format flexibility How do domain training techniques actually reshape model behavior?. Push too hard with difficult reinforcement signals and you don't reshape the model toward your goal — you teach it degenerate shortcuts that contaminate abilities it already had Do overly hard RLVR samples actually harm model capabilities?. The methods that work tend to respect the base distribution rather than fight it: staying close to the base model (low KL drift) preserves the plasticity needed to keep learning Does staying close to the base model preserve learning ability?, and even what a teacher can transfer to a student is capped by the student's existing learning frontier Does teacher-refined data always improve student model performance?.

The thing you might not have known you wanted to know: whether a new fact even 'sticks' is predictable *before* training from how probable the relevant keywords already were — there's roughly a 10⁻³ probability threshold below which priming simply doesn't take hold Can we predict keyword priming before learning happens?. In other words, the foundational model doesn't just outweigh user input at inference — it pre-decides which inputs are even learnable in the first place. The user is always playing on terrain the foundation already drew. If you want a cleaner mechanism for letting a model ignore irrelevant surface changes in your prompt while keeping its substance, consistency training is the doorway worth opening Can models learn to ignore irrelevant prompt changes?.

Sources 11 notes

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Does foundational model training or user priors more strongly shape final outputs?

Sources 11 notes

Next inquiring lines