Can out-of-distribution tests expose memorization in reinforcement learning fine-tuned models?

This explores whether pushing an RL fine-tuned model outside its training distribution can reveal when it's reciting memorized patterns rather than genuinely reasoning — and what the corpus says about memorization as a failure mode in RL-trained models.

This reads the question as: when a model is fine-tuned with reinforcement learning, does it actually learn to reason, or does it just memorize — and can out-of-distribution (OOD) tests catch the difference? The corpus says yes, OOD shift is exactly the lever that exposes memorization, and it has a surprisingly precise account of where that memorization lives. The most direct evidence comes from work decomposing where chain-of-thought reasoning goes wrong Where do memorization errors arise in chain-of-thought reasoning?: it identifies three kinds of memorization (local, mid-range, long-range) and shows that 'local' memorization — predicting the next token from the immediately preceding ones rather than from the actual problem — accounts for up to 67% of reasoning errors, and that this fraction climbs precisely as complexity rises and the input drifts away from the training distribution. In other words, OOD inputs don't just stress the model; they preferentially surface the memorized shortcuts that look like reasoning on familiar problems.

What makes this interesting is that 'memorization' here isn't a single thing, and OOD probing isn't the only way to catch it. There's a parallel diagnostic that doesn't even require new test inputs: probing the model's internal beliefs. Work on RLHF and truth-indifference Does RLHF make language models indifferent to truth? found that after RLHF a model's rate of false claims in unknown scenarios jumped from 21% to 85% — yet internal belief probes showed it still represented the truth correctly. So behavioral OOD failure and internal representation can disagree: the model 'knows' but doesn't commit. That's a useful caution — an OOD test that only watches outputs can mistake an alignment-induced behavior for a knowledge gap, when the deeper structure is intact.

The corpus also complicates the assumption that RL fine-tuning memorizes more than supervised fine-tuning — often it's the opposite. RL tends to optimize for reasoning quality over surface token matching: rewarding explanation rationality rather than token-level correctness embeds knowledge more durably than SFT Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?, and breaking rewards into verifiable sub-criteria explicitly reduces 'overfitting to superficial artifacts' that plague holistic reward models Can breaking down instructions into checklists improve AI reward signals?. So if an OOD test exposes memorization, the reward design — not RL itself — is frequently the culprit.

Two more notes reframe what RL is even doing to the weights, which matters for interpreting OOD results. RL updates only 5–30% of parameters, in sparse but nearly full-rank subnetworks that are almost identical across random seeds Does reinforcement learning update only a small fraction of parameters? — structural, reproducible change, not scattered overfitting. And RL training moves through a two-phase arc: first nailing procedural execution, then shifting the bottleneck to strategic planning Does RL training follow a predictable two-phase learning sequence?. That suggests OOD generalization failures may not be 'memorization' at all but a model stuck in the procedural-mastery phase, having consolidated execution it can't yet redeploy on novel problems.

The thing you might not have known you wanted to know: the sharpest signal of memorization isn't a single OOD accuracy drop — it's *where* errors concentrate. When mistakes cluster on next-token-from-preceding-context prediction and that cluster grows as inputs get less familiar, you're watching memorization get exposed in real time. OOD testing works best not as a pass/fail gate but as a way to localize which part of the reasoning chain was never really reasoning.

Sources 6 notes

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Does RLHF make language models indifferent to truth?

RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can out-of-distribution tests expose memorization in reinforcement learning fine-tuned models?

Sources 6 notes

Next inquiring lines