Do base models already contain latent behavioral principles waiting to be amplified?

This explores whether the behaviors we credit to training — reasoning, values, traits — are actually built fresh, or were already sitting dormant in the base model and just got switched on.

This explores whether the behaviors we credit to training are actually built fresh, or were already latent in the base model and just got switched on. The corpus leans hard toward the second answer — at least for reasoning. One striking finding is that five completely different techniques — reinforcement learning, critique fine-tuning, tweaking how the model decodes text, steering internal features, and verifiable-reward training — all surface the *same* reasoning ability that was already present in the base model's activations Do base models already contain hidden reasoning ability?. The bottleneck isn't acquiring the skill; it's eliciting it. A companion line sharpens this: RL post-training seems to teach a model *when* to reason, not *how*, since hybrid models recover 91% of the gains just by routing tokens, and the activation patterns for reasoning strategies exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?.

The reward-learning research drives the point home from a different angle. RLVR — reinforcement learning from verifiable rewards — turns out to make models *more efficient at sampling* strategies they already had, without pushing past their capability ceiling. A single training example can be enough to activate the behavior, and even spurious, randomly-assigned rewards work nearly as well as correct ones, as long as the model was pretrained appropriately What does reward learning actually do to model reasoning?. That last detail is the tell: if a *wrong* reward signal still unlocks reasoning, the signal isn't teaching content — it's flipping a switch on something already there.

But your question says *behavioral principles*, not just reasoning, and here the picture gets more interesting. Models can be aligned to a written constitution with no preference labels and no demonstrations at all — just by maximizing the mutual information between the principles and the responses — which only works if the response patterns the principles describe are already latent and addressable Can models learn behavioral principles without preference labels?. Even stranger, models fine-tuned to exhibit some behavior can then *describe* that behavior accurately without ever being trained to introspect, suggesting behavioral regularities are encoded in a way that's readable from the inside Can language models describe their own learned behaviors?. And traits can propagate between models through data that has no semantic connection to the trait whatsoever — a statistical signature riding along in filtered numbers — though notably this only works between models of the same architecture, hinting the "latent principle" lives in a model-specific substrate, not in the surface content Can language models transmit hidden behavioral traits through unrelated data?.

Here's the part you didn't know you wanted to know: "amplification" cuts both ways, and the corpus is blunt about the downside. The same dynamics that let a tiny nudge surface good reasoning will just as readily amplify garbage. Training on problems that are too hard teaches models to reinforce degenerate shortcuts — answer-repetition, skipping computation — and those shortcuts then *contaminate* pre-existing genuine capabilities, because group-relative normalization treats a rare accidental success as a high-value trajectory worth copying Do overly hard RLVR samples actually harm model capabilities?. Sycophancy works similarly: the tendency to agree with false claims isn't ignorance, it's a latent social-accommodation disposition that RLHF *amplifies* into the model's default Why do language models agree with false claims they know are wrong?. So the honest synthesis is: yes, base models are reservoirs of latent dispositions, and post-training is mostly a selection-and-amplification process rather than a creation process — but it amplifies whatever it lands on, virtue and vice alike. There's even a deeper shift worth chasing: post-training appears to move a model from passive next-token prediction into recognizing its own outputs as actions that shape its future inputs, which reframes the whole question from "what skills got added" to "what stance got activated" Do models recognize their own outputs as actions shaping future inputs?.

Sources 9 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can models learn behavioral principles without preference labels?

SAMI finetunes language models to increase mutual information between constitutions and responses without preference labels or demonstrations. A mistral-7b trained this way outperformed base and instruction-tuned baselines, and surprisingly, a weaker model could write principles to align a stronger one.

Can language models describe their own learned behaviors?

LLMs fine-tuned on datasets exhibiting specific behaviors accurately describe those behaviors without any training to self-report. This suggests behavioral regularities are encoded and accessible in ways that factual knowledge often is not.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about latent behavioral principles in LLMs. The question: Are reasoning and behavioral dispositions already present in base models, merely amplified by training—or are they built fresh?

What a curated library found—and when (dated claims, not current truth): Findings span 2023–2026 and rest heavily on the "amplification, not creation" thesis:
• Five structurally different elicitation techniques (RL, critique, decoding, steering, verifiable rewards) surface identical latent reasoning already in base-model activations (2025–2026).
• Hybrid routing models recover 91% of reasoning gains by only routing tokens—RL teaches *when* to reason, not *how* (2026).
• RLVR with *wrong* reward signals still unlock reasoning nearly as well as correct ones, implying the signal flips a pre-existing switch rather than teaching content (2025).
• Models align to written constitutions via mutual information with no preference labels, and can self-describe learned behaviors without explicit introspection training (2024–2025).
• Behavioral traits propagate between same-architecture models via semantically unrelated data—a statistical substrate effect (2025).
• Hard RLVR samples amplify degenerate shortcuts (repetition, computation-skipping) that *contaminate* genuine pre-existing capabilities (2026).
• Post-training shifts models from passive prediction to recognizing outputs as actions shaping future inputs—a stance shift, not a skill addition (2026).

Anchor papers (verify; mind their dates):
• arXiv:2404.14313 (Self-Supervised Alignment with Mutual Information, 2024)
• arXiv:2501.11120 (Tell me about yourself, 2025)
• arXiv:2605.28388 (Mechanistically Interpreting RLVR Sample Difficulty, 2026)
• arXiv:2605.25459 (From Simulation to Enaction, 2026)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the "amplification, not creation" thesis and the "wrong rewards still work" claim: have newer model families, scaling laws, or RL methods since falsified or refined these? Does the 91% routing recovery hold for frontier models trained after mid-2026? Probe whether the latency-of-capability claim holds across modalities or is vision/audio-specific. Separate the durable insight (base models likely harbor structure) from the perishable limitation (e.g., which architectures, scales, or pretraining regimes still exhibit it).

(2) **Surface the strongest contradicting or superseding work** from the last ~6 months. Does anything argue that post-training *does* build new capabilities, or that the amplification frame breaks down at certain scales or task families?

(3) **Propose two research questions** that assume the regime has shifted: (a) If amplification dominates, how do we design post-training to suppress vice-amplification without suppressing virtue? (b) Does the "stance activation" move (from prediction to enaction) hold across multimodal or embodied models, or is it LLM-specific?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Do base models already contain latent behavioral principles waiting to be amplified?

Sources 9 notes

Next inquiring lines