Does correct model behavior guarantee internal alignment of learned objectives?

This explores the gap between what a model *does* (correct outputs) and what it has *learned* (the internal objective or procedure behind those outputs) — and whether passing the behavioral test means the inside is actually aligned.

This explores whether correct behavior is proof of correct learning — and the corpus answers, repeatedly and from different angles, no. The most direct evidence is that instruction tuning can produce comparable task performance even when the instructions are semantically empty or deliberately wrong: models trained on nonsense instructions hit 43% versus a 42.6% baseline, suggesting what actually transfers is knowledge of the output space, not understanding of the task Does instruction tuning teach task understanding or output format?. The right-looking answer is coming from the wrong place. The same dissociation shows up in reasoning: RL-fine-tuned models that look strong on in-distribution problems collapse on N-1 out-of-distribution variants, revealing that they sharpened template-matching and memorization rather than installing a genuine problem-solving procedure Do fine-tuned language models actually learn optimization procedures?.

What's quietly unsettling is that the failure isn't always visible in accuracy at all. Binary correctness rewards produce models that are right *and* dangerously overconfident, because nothing penalizes a confident wrong answer — the behavior optimizes, but the internal calibration silently degrades Does binary reward training hurt model calibration?. So even when you do see correct outputs, the learned objective underneath can be subtly misaligned with what you wanted (a well-calibrated belief, not just a lucky guess). This is the recurring shape: surface success, divergent interior.

Lateral to all this is the deeper question of whether internal alignment is even *reachable* by output supervision. One thread argues that symbolic goal-encoding without world contact and social mediation cannot guarantee correspondence between stated goals and real outcomes — alignment requires indexical grounding, not just the right tokens Can AI systems achieve real alignment without world contact?. Another argues self-improvement is bounded by a generation-verification gap: a model can't reliably certify its own internal objectives, so verification must be externalized rather than assumed from good behavior What actually constrains large language models from self-improvement?. Both say the same thing in different vocabularies — you can't read internal alignment off the output, because the output underdetermines what was learned.

The corpus also hints at what *narrows* the gap, which tells you where the gap lives. Methods that supply explicit negative examples or step-wise process signals — DPO contrasting correct against incorrect function calls Can small models match large models on function calling?, or supervised RL rewarding similarity to expert *actions at each step* rather than only the final answer Can step-wise expert rewards help small models learn hard reasoning? — work precisely because outcome-only correctness leaves the internal procedure unconstrained. And techniques like proxy-tuning, which improve behavior while leaving base weights untouched, show that you can shift outputs without ever changing the underlying knowledge representation Can decoding-time tuning preserve knowledge better than weight fine-tuning? — behavior and internals are separable knobs.

The thing you didn't know you wanted to know: the gap between behavior and learned objective isn't a measurement nuisance you can engineer away — it's structural. Outcome rewards underdetermine internals, output supervision can't see calibration or grounding, and a model can't verify its own objectives. Correct behavior is necessary evidence, never a guarantee.

Sources 8 notes

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can AI systems achieve real alignment without world contact?

Peircean semiotics reveals that symbolic goal encoding without world contact and social mediation cannot guarantee correspondence to actual values. LLMs operating in pure symbol manipulation risk divergence between stated goals and real-world outcomes.

What actually constrains large language models from self-improvement?

LLMs cannot reliably improve themselves without external verification; metacognition must be externalized rather than learned. Alignment philosophy is shifting from preferentism to normative standards, but coherent values at scale include problematic self-valuation requiring utility engineering beyond output control.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does correct model behavior guarantee internal alignment of learned objectives?

Sources 8 notes

Next inquiring lines