INQUIRING LINE

How do inference-time reward methods compare to per-user fine-tuning?

This explores the trade-off between steering a model's behavior at generation time (using a reward signal to nudge outputs without touching weights) versus actually retraining the weights to fit a particular user.


This explores two different places you can inject a user's preferences: into the model's outputs at the moment it generates (inference-time reward methods), or into the model's weights themselves (per-user fine-tuning). The most direct answer in the collection is PReF, which personalizes purely at inference time — it learns a set of base reward functions once, then infers a specific user's preferences as a lightweight combination of those, taking as few as ten adaptive questions to lock in. No weights are modified per user Can user preferences be learned from just ten questions?. The appeal is obvious: you skip a training run for every individual, and personalization becomes a cheap, reversible dial rather than a permanent change.

What does fine-tuning buy you that the inference-time route can't? The collection suggests the answer is depth of internalization. When reinforcement learning rewrites weights, it doesn't scatter changes everywhere — it reliably edits a sparse, structured subnetwork (5–30% of parameters), nearly the same one across random seeds, which hints that training durably restructures how the model reasons rather than just biasing its surface outputs Does reinforcement learning update only a small fraction of parameters?. There's a sharper version of this gap: reasoning models keep beating non-reasoning ones no matter how much inference-time compute you throw at the weaker model, because the training regime instilled a protocol that makes extra tokens productive. You can't always buy at inference time what was installed during training Can non-reasoning models catch up with more compute?.

But the inference-time camp has been quietly getting more powerful, which narrows that gap. Reward models themselves can now reason before they score — adding chain-of-thought to evaluation raises their ceiling and lets them scale with test-time compute, so the 'judge' guiding generation is no longer a fixed function Can reward models benefit from reasoning before scoring?. And that compute can be spent intelligently: allocating more inference budget to hard prompts and less to easy ones beats a fixed budget, meaning inference-time steering can be tuned per-case rather than per-user Can we allocate inference compute based on prompt difficulty?.

The interesting twist is that the two approaches aren't a clean either/or — the boundary blurs. Test-Time RL uses majority-vote agreement across samples as its own reward signal at deployment, then trains on it, turning what starts as inference-time compute into actual weight updates with no labels at all Can models improve themselves using only majority voting?. So the real design question isn't 'reward at inference or fine-tune,' it's where on the spectrum you commit a preference. Inference-time reward methods are cheap, reversible, and ideal when users are many and preferences shift — exactly the per-user case PReF targets. Fine-tuning is the right tool when a behavior needs to become load-bearing and permanent.

The thing worth carrying away: the reward signal is the shared currency between both worlds. Whether it nudges a single generation or drives a gradient step, the quality of that signal dominates — and we know it has sharp failure modes, like binary correctness rewards that degrade a model's calibration into overconfident guessing Does binary reward training hurt model calibration?. A well-designed reward steers well at inference and trains well at fine-tune time; a badly designed one corrupts both. The choice of where to apply it is almost secondary to getting the signal right.


Sources 7 notes

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can we allocate inference compute based on prompt difficulty?

Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Next inquiring lines