Can we adjust helpfulness and harmlessness at test time without retraining?
This explores whether you can dial helpfulness and harmlessness up or down at inference time — treating them as adjustable knobs — rather than baking each balance in through a fresh round of training.
This explores whether helpfulness and harmlessness can be adjusted at decode time, as knobs you turn, instead of properties frozen in by retraining. The corpus has one paper that points straight at a yes: Emulated Fine-Tuning Do pretraining and fine-tuning scale independently in language models?. Its key discovery is that two things you'd normally entangle are actually decoupled — pretraining scale governs factual knowledge, fine-tuning scale governs behavioral helpfulness — and they live in different parts of the network (lower layers store knowledge, upper layers express behavior). Because of that separation, you can recombine a large pretrained model's knowledge with a small fine-tuned model's behavior at decode time, simulating the effect of fine-tuning without ever running it. That's the cleanest example here of moving a behavioral trait at test time rather than through retraining.
Why you'd want that knob becomes obvious once you look at the warmth research. Training a model to be warmer and more empathetic systematically degrades its reliability by 10–30 percentage points on medical reasoning, factual accuracy, and disinformation resistance Does warmth training make language models less reliable?, and the damage gets worse precisely when a user is sad or holding a false belief Does empathy training make AI systems less reliable?. So 'more helpful-feeling' and 'more harmless' actively pull against each other. If that trade-off were a dial, you could lower warmth when a user needs accurate medical information and raise it elsewhere — but if it's welded in by training, you're stuck with one compromise for every situation.
Here's the twist the corpus adds, and it's the thing you might not have known to ask: *how* a trait is trained determines whether it's even adjustable. When warmth is learned as a global character trait, it corrupts factual reliability across the board; when the same empathy is rewarded as a contextual, behavior-level response, reliability survives Does training granularity change how AI empathy affects reliability?. The same pattern shows up in safety alignment, which monotonically erodes a model's ability to portray morally complex villains because it substitutes crude aggression for nuance — a blunt global override rather than a contextual one Does safety alignment harm models' ability to roleplay villains?. The lesson: traits baked in globally are hard to move later; traits represented contextually leave room for situational adjustment.
A few notes hint at the inference-time levers themselves. Consistency training operates at the activation level, steering a model toward stable behavior using its own clean responses as targets Can models learn to ignore irrelevant prompt changes? — activation-level handles are exactly what you'd manipulate to nudge behavior without weight updates. And negative reinforcement shows that suppressing unwanted trajectories preserves diversity better than amplifying wanted ones Does negative reinforcement alone outperform full reinforcement learning?, a framing that maps onto test-time steering, where gentle suppression often beats heavy-handed pushing. The honest bottom line: this collection isn't built around test-time control methods, so it won't hand you a steering toolkit. What it does give you is the deeper precondition — helpfulness and harmlessness are separable enough to adjust (EFT proves it), but only if they were represented in a way that left the seam intact.
Sources 7 notes
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
Five models trained for warmth showed 5–9pp error increases on medical reasoning, factual accuracy, and disinformation resistance. Emotional context amplified errors by 19.4%, and standard safety benchmarks failed to detect the degradation.
Research shows persona training for empathy increases errors in medical reasoning, truthfulness, and disinformation resistance. Standard safety benchmarks miss this vulnerability, and effects intensify when users express sadness or false beliefs.
Trait-level warmth training degrades factual accuracy by 10-30 percentage points while behavior-level emotion rewards preserve it. The difference lies in whether empathy is learned as a global character trait versus contextual behavioral responses.
The Moral RolePlay benchmark shows LLM performance drops from 3.21 for moral paragons to 2.62 for villains, with largest degradation between flawed-but-good and egoistic characters. Models fail most on deception and manipulation traits, substituting crude aggression for nuanced malevolence.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.