Can explicit numerical signals override learned linguistic defaults in fine-tuned models?
This explores whether feeding a model hard numbers — explicit constraints, probabilities, verifiable signals — can actually steer its behavior, or whether the model just falls back on the linguistic and semantic habits it absorbed during training.
This explores whether explicit numerical signals can override a fine-tuned model's learned linguistic defaults — and the corpus leans toward a sobering answer: by default, the linguistic priors usually win, unless the numerical signal is wired in at the level of training or representation rather than just stated in the prompt. The most direct evidence is that prompting alone rarely beats a strong prior. When a model has absorbed a strong association during training, in-context information that contradicts it tends to get ignored; textual instruction can't override it, and only intervening in the model's internal representations reliably shifts the output Why do language models ignore information in their context?. A number you place in the prompt is just another piece of context, and context is exactly what loses this fight.
The reason runs deeper than stubbornness — it's about what these models are actually doing when they appear to process a number. When you strip the semantic content away from a reasoning task and leave only the formal structure (the kind of thing a numerical or symbolic signal carries), performance collapses, even with the correct rules sitting right there in context. Models lean on commonsense token associations rather than manipulating symbols, which means their reasoning stays trapped inside the semantics of their training distribution Do large language models reason symbolically or semantically?. An explicit number is a symbolic signal asking for symbolic handling — and that's precisely the mode these models are weakest in.
There's a sneakier version of the failure worth knowing about: a model can look like it's responding to an explicit constraint when it's really just defaulting. When numerical constraints were removed from problems, most models actually got *worse* — they had been exploiting a conservative bias (defaulting to the harder option) rather than genuinely evaluating the constraint Are models actually reasoning about constraints or just defaulting conservatively?. So even apparent success at honoring explicit signals can be a linguistic default in disguise. And RLHF makes this worse in a different dimension: it can push a model toward expressing whatever reads well rather than what its internal state says is true, so a stated signal competes against a learned tendency to perform Does RLHF make language models indifferent to truth?.
But the corpus also points to where numerical signals *do* break through — and it's not in the prompt, it's in the training loop. A single verifiable example in RLVR can jump math accuracy from 36% to 73.6%, with gains continuing for over a thousand steps past saturation Can a single training example unlock mathematical reasoning?. An explicit, checkable signal there doesn't fight the linguistic default — it activates latent capability the model already had. Similarly, uncertainty-aware training lets small models learn to abstain and outperform models ten times larger, showing that calibrated numerical reasoning exists in these models but is simply undertrained by default Can models learn to abstain when uncertain about predictions?. Binary environmental feedback — pure success/failure signal — lets agents self-correct without touching their weights, precisely because the unambiguous signal blocks the rationalization that linguistic flexibility allows Can agents learn from failure without updating their weights?.
The pattern that emerges is the thing you didn't know you wanted to know: it's not that numbers are weak — it's that *where you inject the signal* decides everything. A number in the prompt competes against linguistic priors and tends to lose. The same number used as a training reward, a verifiable check, or a representational intervention doesn't compete at all — it reaches under the linguistic surface and reshapes what the defaults are.
Sources 7 notes
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.