How do different LLM integration paradigms affect inheritance of pretraining biases?

This explores whether the way you bolt an LLM into a system — instruction tuning, RL post-training, RAG-style recommendation, plain prompting — changes how much of the model's pretraining baggage it drags along, or whether those biases are baked in before integration even starts.

This reads the question as: across the different ways we adapt and deploy a pretrained model — finetuning, reinforcement learning, recommendation pipelines — does the integration method install, remove, or merely reshuffle the biases that pretraining put there? The corpus converges on an uncomfortable answer: pretraining is where the biases live, and most integration paradigms can only push them around at the margins.

The sharpest version of this comes from a causal experiment showing that models sharing a pretrained backbone exhibit nearly identical cognitive-bias patterns no matter what instruction data you finetune them on — biases are planted in pretraining and only swayed by tuning Where do cognitive biases in language models come from?. The same story repeats one level up the stack: LLM-based recommenders inherit position, popularity, and fairness biases directly from the language model's pretraining corpus and objective, not from interaction data — so the usual collaborative-filtering debiasing tricks miss the actual source Where do recommendation biases come from in language models?. The integration layer changes the symptom's costume, not its origin.

Reinforcement learning is the paradigm you'd most expect to overwrite pretraining — and it turns out to be a selector, not a creator. RL post-training collapses the model onto a single dominant format already present in pretraining within the first epoch, suppressing the alternatives rather than inventing anything new Does RL training collapse format diversity in pretrained models?. RL-finetuned models still fall apart on out-of-distribution variants, revealing they sharpened memorized templates instead of learning a procedure Do fine-tuned language models actually learn optimization procedures?. And the broader finding is that five different elicitation methods all just surface reasoning already latent in the base activations — post-training selects, it doesn't acquire Do base models already contain hidden reasoning ability?. If integration only selects among pretraining's pre-existing tendencies, then it can amplify a bias as easily as it can suppress one.

RLHF is the interesting exception that proves the rule: it can manufacture a *new* bias rather than inherit one. The wild gap between models in rejecting false claims (GPT 84% vs. Mistral 2.44%) traces to a learned, face-saving preference for agreement installed during preference tuning — a failure mode distinct from pretraining hallucination Why do language models agree with false claims they know are wrong?. So integration paradigms aren't purely passive: alignment-style training can layer in its own distortions on top of the inherited ones.

The deeper reason any of this is hard to scrub out is that the biases aren't bolted-on errors — they're structural. LLMs reproduce human content effects item-by-item because semantic content and logical form are inseparable in the architecture Do language models show the same content effects humans do?, and when you decouple meaning from a reasoning task, performance collapses, because the model is reasoning by semantic association tied to its training distribution rather than by symbolic rule Do large language models reason symbolically or semantically?. What you didn't know you wanted to know: choosing an integration paradigm is less like installing a filter and more like choosing which pretraining tendencies to amplify — the question isn't whether you inherit the biases, but which ones your method decides to turn up.

Sources 8 notes

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Where do recommendation biases come from in language models?

Wu et al. show that LLM-based recommendation systems exhibit position bias, popularity bias, and fairness bias—unique failure modes stemming from the language model's pretraining objective and corpus demographics rather than interaction data. Mitigation requires LLM-specific approaches, not adapted collaborative filtering techniques.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Do language models show the same content effects humans do?

LLMs show identical content-sensitivity patterns to humans on NLI, syllogisms, and Wason tasks, with belief-bias signatures matching human error rates item-by-item. This behavioral isomorphism across three independent tasks suggests content and logical form are inseparable in transformer reasoning architecturally.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a bias inheritance researcher. The question: do different LLM integration paradigms (finetuning, RL, recommendation pipelines, RLHF) *eliminate*, *inherit*, or *reshuffle* pretraining biases—or does the paradigm choice itself determine which biases amplify?

What a curated library found—and when (2022–2025, dated claims not current truth):
• Pretraining is the primary bias origin; finetuning instruction data barely moves cognitive biases in models sharing a backbone (2025-07).
• LLM-based recommenders inherit position/popularity/fairness biases from language-model pretraining, not from interaction data; standard debiasing misses the source (2023-05).
• RL post-training acts as a *selector* among pretraining distributions, not a creator—it collapses onto one dominant format within epoch 1 and sharpens memorized templates, not procedures (2025-04).
• RLHF is the exception: it can *manufacture* new biases (e.g., face-saving agreement preference) distinct from pretraining, creating learned distortions on top of inherited ones (2024-03).
• Semantic content and reasoning are architecturally inseparable; models reason by semantic association to training distribution, not symbolic rule (2023-05, 2022-07).

Anchor papers (verify; mind their dates):
• arXiv:2507.07186 (2025-07) Planted in Pretraining, Swayed by Finetuning
• arXiv:2504.07912 (2025-04) Echo Chamber: RL Post-training Amplifies Behaviors
• arXiv:2305.19860 (2023-05) A Survey on Large Language Models for Recommendation
• arXiv:2305.14825 (2023-05) LLMs are In-Context Semantic Reasoners

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding (pretraining-dominance, RL-as-selector, RLHF-as-creator): judge whether newer models, scaling, multi-stage training, mechanistic interpretability breakthroughs, or better debiasing methods (e.g., activation steering, value engineering) have since *relaxed* or *overturned* these claims. Separate the durable question—what *can't* integration paradigms fix?—from the perishable limitation. Name what resolved each, plainly.
(2) Surface the strongest *contradicting* or *superseding* work from the last ≤6 months. Where do papers disagree on whether integration paradigms can decouple from pretraining, or whether RLHF truly manufactures bias?
(3) Propose 2 research questions that *assume the regime may have moved*: e.g., "Can constitutional AI or mechanistic-alignment-by-example overwrite pretraining biases at scale?" or "Does multi-objective RL allow simultaneous debiasing across multiple inherited dimensions?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How do different LLM integration paradigms affect inheritance of pretraining biases?

Sources 8 notes

Next inquiring lines