Can models learn behavioral principles without preference labels?
Can alignment happen by amplifying the latent connection between stated principles and model behavior, rather than relying on expensive human preference annotations? This explores whether information-theoretic objectives could replace the preference-labeling bottleneck.
Instilling behavioral principles usually needs human preference labels or demonstrations — expensive and technically demanding. SAMI builds on the insight that alignment methods mostly expose and amplify behavior already implicit in the base model: a pretrained model already has a weak statistical connection between behavioral principles stated in natural language and the behavior that realizes them. SAMI is an iterative algorithm that finetunes the LM to increase the conditional mutual information between constitutions and self-generated responses (given queries) — requiring no preference labels and no demonstrations. A SAMI-trained mistral-7b beats the base model (66–77% win rates) and even surpasses the instruction-tuned baseline on single-turn dialogue; strikingly, a weak instruction-tuned model can write the constitution used to align a stronger base model.
The keeper is the mechanism: alignment as amplifying a latent principle-behavior correlation via an information-theoretic objective, sidestepping the preference-label bottleneck — and the weak-to-strong direction (weak constitution-writer, strong aligned model) is a notable scalable-oversight signal.
This sits in the vault's alignment-without-labels thread. It is a constitutional-style method that, like Can models learn to ask better clarifying questions through self-improvement? and the verifier-free RL family, removes a human-supervision dependency — and it presupposes the latent-capability premise that Do base models already contain hidden reasoning ability? makes for reasoning, here for behavioral principles.
Inquiring lines that use this note as a source 7
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do base models already contain latent behavioral principles waiting to be amplified?
- What makes principle-response mutual information sufficient for behavioral alignment?
- Can information-gain principles improve how we choose what to label?
- How do static benchmarks fail to capture human preference alignment?
- How does preference learning differ from supervised finetuning for reasoning?
- Can preference trees structure alignment data for domains beyond math and code?
- What does egalitarian social choice theory contribute to AI alignment?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
SAMI's premise: alignment amplifies a latent principle-behavior connection already in the base model
-
Can models learn to ask better clarifying questions through self-improvement?
This explores whether question-asking is a trainable skill that improves when models are rewarded for questions that lead to better answers. It matters because asking good clarifying questions could help AI systems handle underspecified user requests.
sibling method removing a human-supervision dependency via self-improvement
-
Are RLHF annotations actually measuring genuine human preferences?
RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?
SAMI sidesteps the preference-elicitation step whose validity that note questions
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels
- Post-training makes large language models less human-like
- Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment
- Direct Language Model Alignment from Online AI Feedback
- The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs
- MaxMin-RLHF: Alignment with Diverse Human Preferences
- Beyond Preferences in AI Alignment
- Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries
Original note title
a model can be aligned to behavioral principles without preference labels by maximizing mutual information between a constitution and its responses