Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.

Note · 2026-02-22 · sourced from Training Fine Tuning
How do you build domain expertise into general AI models? How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Proxy-tuning fine-tunes a small model, then applies the difference between the small tuned and small untuned model's predictions to shift a large untuned model's outputs at decoding time. The large model's parameters are never modified. The method closes 91% of the performance gap between Llama-2-13B and its directly tuned CHAT version, and 88% for the 70B model.

The critical finding: on knowledge-intensive tasks, proxy-tuning sometimes surpasses the performance of direct instruction-tuning. This is because direct fine-tuning modifies model weights — and some of those modifications overwrite pretrained knowledge. Since Why does reasoning training help math but hurt medical tasks?, weight modification risks corrupting the knowledge storage that proxy-tuning leaves intact.

Proxy-tuning primarily promotes reasoning and stylistic tokens. Analysis of the token-level distributional shift shows the largest influence on tokens associated with reasoning patterns and output style — consistent with evidence that "alignment mainly affects style rather than knowledge." This aligns with Does instruction tuning teach task understanding or output format? and Can imitating ChatGPT fool evaluators into thinking models improved?: what fine-tuning actually changes is output distribution, not capability. Proxy-tuning achieves this distributional change without touching the model weights that encode knowledge.

For domain adaptation, proxy-tuning Llama-2-13B using CodeLlama-7B produces 17-32% improvement on coding benchmarks. The small expert provides the distributional guidance; the large base model provides the knowledge. An optional hyperparameter controls the amount of guidance, enabling runtime trade-offs between different generation attributes.

This constitutes a fifth paradigm in the How do knowledge injection methods trade off flexibility and cost?: decoding-time adaptation. Zero training cost on the target model, full knowledge preservation, but requires access to base model logits at inference time.

ARGS (Alignment as Reward-Guided Search) provides a complementary inference-time method. Instead of applying a distributional shift from a tuned proxy, ARGS adjusts model predictions at each decoding step using a reward signal directly. Two components: reward-guided scoring (assigns scores to possible continuations) and token selection (selects a continuation based on scored candidates). A tunable weight controls the trade-off between semantic relevance and alignment criteria — setting it to zero recovers standard maximum-likelihood decoding. ARGS enables rapid personalized alignment without retraining: different users can have different reward functions applied at inference time. Together, proxy-tuning (distributional shift from expert delta) and ARGS (reward-guided decoding) suggest a design space where multiple axes of adaptation — domain knowledge, user preferences, task constraints — can each be applied at decoding time through complementary mechanisms. See Can user preferences be learned from just ten questions? for how per-user reward functions can be efficiently constructed.


Source: Training Fine Tuning

Related concepts in this collection

Concept map
17 direct connections · 187 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

proxy tuning at decoding time preserves pretrained knowledge better than direct fine-tuning by applying the tuning signal as a distributional shift