How do domain training techniques actually reshape model behavior? · Gravity7

SFT-RL Training Dynamics

2 notes

Why does SFT-then-RL training follow a predictable three-phase pattern?

When expert data diverges from a model's learned patterns, SFT-then-RL training exhibits disruption, readaptation, and overfitting phases. Understanding this progression could improve how we combine imitation and reinforcement learning.

Does RL training collapse format diversity in pretrained models?

Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.

Training Quality and Compatibility

5 notes

Does critiquing errors teach deeper understanding than imitating correct answers?

Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.

Does teacher-refined data always improve student model performance?

Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.

Why does reasoning training help math but hurt medical tasks?

Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.

Why do LLMs struggle to connect unrelated entities speculatively?

LLMs reliably organize and summarize evidence but fail when asked to speculate about connections between dissimilar entities. Understanding this failure could reveal fundamental limits in how models handle complex analytical reasoning.

Does fine-tuning weaken how reasoning steps influence answers?

When models are fine-tuned on domain-specific tasks, do their chain-of-thought reasoning steps actually causally drive the final answer, or do they become decorative? This matters because accurate outputs can mask unfaithful reasoning.

Alignment Data Efficiency

2 notes

Can 1000 carefully chosen examples align models effectively?

Does alignment require massive datasets, or can strategic curation of small, high-quality examples achieve comparable performance? LIMA tests whether quality beats quantity in post-training.

Can aligned LLMs generate their own training data?

Does feeding an aligned model only its prompt template cause it to self-synthesize high-quality instructions? This explores whether alignment training encodes a latent instruction-generation capability.

Self-Generated Domain Training Data

2 notes

Does self-generated training data improve model learning?

Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.

Can synthetic dialogues become realistic through layered diversity?

Explores whether combining persona variation, subtopic specificity, and contextual grounding can generate synthetic dialogues that match real conversational data quality and capture the full spectrum of dialogue diversity.

Knowledge Graph-Based Domain Specialization

1 note

Can knowledge graphs teach models deep domain expertise?

Explores whether organizing knowledge as structured graph paths, composed from simple to complex, can enable language models to develop genuine domain superintelligence rather than surface-level pattern matching.

Parameter-Efficient and Training Techniques

8 notes

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.

Can models learn multi-token concepts during fine-tuning?

Does training models to predict multiple tokens at once, rather than one token sequentially, help them form coherent semantic units? This matters because current next-token prediction fragments concepts like "ribonucleic acid" into arbitrary subword pieces.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Explores whether identifying and protecting task-specific parameter regions can prevent the performance degradation that occurs when fine-tuning models on multiple tasks simultaneously. This matters because it could enable safe multi-task adaptation without sacrificing individual task performance.

Can we train better models on less data?

Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.

Can semantic knowledge shift model behavior like reinforcement learning does?

Can textual descriptions of successful reasoning patterns, prepended as context, achieve the same distribution shifts that RL achieves through parameter updates? This matters because it could eliminate the need for expensive fine-tuning on limited data.

Can context playbooks prevent knowledge loss during iteration?

When AI systems iteratively refine their instructions and memories, do structured incremental updates better preserve domain knowledge than traditional rewriting? This matters because context degradation undermines long-term agent performance.

Can models dynamically activate expert skills at inference time?

Can language models efficiently discover and compose task-specific capabilities on the fly without modifying base weights? This explores whether test-time adaptation through expert vector composition outperforms fixed fine-tuning approaches.

Does reasoning rely on procedural knowledge or factual memorization?

Explores whether LLMs learn reasoning through general procedural patterns across documents or through memorizing specific facts. Understanding this distinction matters for training data strategy.

Verifier-Free and Multi-Task RL

2 notes

Can reasoning RL work without verifying generated answers?

Most reasoning RL methods require answer verification, limiting them to math and code. Can models be trained to reason better in domains like medicine and law where verification is impractical?

Does training order reshape how models handle different task types?

Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.

RLVR Extensions to General Domains

6 notes

Can breaking down instructions into checklists enable better reinforcement learning?

Explores whether decomposing instruction quality into verifiable yes/no criteria allows RL systems to improve on tasks that lack clear correctness signals, like creative writing or social reasoning.

What makes rubric-based reward learning resistant to exploitation?

Rubric-based RL systems face reward hacking vulnerabilities. This explores what design patterns, architectural mechanisms, and iterative defenses enable rubrics to remain robust against model exploitation across diverse tasks.

Can model confidence alone replace external answer verification?

Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.

Can reasoning emerge from expert demonstrations alone?

Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.

Can adaptive guidance from solution traces reduce reward sparsity in RL?

When reinforcement learning struggles with hard problems due to sparse rewards and zero-advantage rollouts, does providing partial solution traces as adaptive guidance help the model learn more efficiently? This matters because standard RL wastes compute on unsolvable problems.

Why does RLVR training narrow a model's problem solving ability?

RLVR's on-policy constraint may force models to exploit known reasoning paths rather than explore new ones, potentially shrinking their effective problem-solving scope. Understanding this mechanism could reveal how to design better exploration incentives in language model reasoning.

Pass 3 Additions (2026-05-03)

4 notes

Can reconstructing expert thinking improve reasoning transfer?

Expert texts show only the final result of complex thinking. Can we reverse-engineer those hidden thought processes and use them to train models that reason better across different domains?

Why do language models need so much more text than humans?

Language models train on the surface of written text, but humans learn by inferring the underlying thoughts behind what they read. Does this explain why models need vastly more data to reach human-level understanding?

Can agents learn beyond what their training data shows?

Explores whether supervised fine-tuning on expert demonstrations creates a hard ceiling on agent competence, or whether agents can generalize to scenarios their curators never captured.

How do quality, diversity, and complexity affect synthetic data differently?

When training models on synthetic data, do quality, diversity, and complexity each play distinct roles in how well models generalize? Understanding their separate effects could explain why current optimization strategies fail.

Related Areas

6 notes

How do you add domain expertise without losing general reasoning?

Exploring the tension between injecting specialized knowledge and preserving a model's broad problem-solving ability. Five distinct approaches exist, each with different trade-offs in cost, flexibility, and reliability.

What actually changes inside a model during RL training?

RL training modifies only sparse regions of model parameters through suppression of incorrect paths rather than broad capability building. Understanding these mechanics reveals how fine-tuning shapes reasoning and what hidden costs accompany optimization.

What does reward learning actually do to model reasoning?

Explores whether RLVR expands reasoning capabilities or merely activates latent skills. Investigates the mechanism by which rewards reshape model outputs and whether this constitutes genuine learning or efficient sampling.

How do you add domain expertise without losing general reasoning?

Exploring the tension between injecting specialized knowledge and preserving a model's broad problem-solving ability. Five distinct approaches exist, each with different trade-offs in cost, flexibility, and reliability.

How do you build domain expertise into general AI models?

When LLMs are trained on everything, they excel at nothing. This explores the core trade-off: how to inject deep domain knowledge without creating brittle specialists that fail outside their niche.

How should researchers navigate LLM reasoning research?

This note explores how to systematically explore interconnected insights about test-time scaling, reasoning architectures, and language model cognition. It matters because LLM research spans multiple domains—from inference compute to philosophy—and understanding the map helps identify novel connections.