Can expert-derived knowledge bases scale to other high-stakes domains?
This explores whether the recipe behind expert-built knowledge bases — proven in places like medicine — can be ported to other high-stakes fields, or whether each domain needs its own bespoke effort.
This explores whether knowledge bases distilled from human experts can be reused across high-stakes domains, rather than rebuilt from scratch each time. The corpus is cautiously optimistic, but the optimism rests on a specific insight: what scales is *structure*, not data. When a 32B model was fine-tuned on reasoning tasks derived from medical knowledge-graph paths, it hit state-of-the-art across fifteen medical sub-domains — and the lesson the authors draw is that compositional primitives matter more than raw scale Can knowledge graphs teach models deep domain expertise?. That's encouraging for transfer, because primitives and composition rules are exactly the kind of thing you can re-derive in a new field.
The same theme shows up from several angles. StructTuning reaches half of full-corpus performance using 0.3% of the data, simply by organizing chunks into auto-generated domain taxonomies — the model learns where a fact sits in a conceptual map, the way a student learns from a textbook rather than a flood of pages Can organizing knowledge structures beat raw training data volume?. An industrial case study went further and skipped retraining entirely: by codifying expert rules and design principles directly into an agent's scaffolding, non-experts produced expert-rated work, a 206% quality jump that came from *externalizing tacit expertise* into the harness, not from a bigger model Can codified expertise let non-experts match specialist output?. And when you do want the knowledge inside the weights, reinforcement learning from augmented generation internalizes it more coherently than supervised fine-tuning by rewarding reasoning quality over token-matching Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. Each of these is a domain-agnostic *method* — a reason to think the playbook travels.
But here's the catch you didn't ask for, and it's the most important finding in the corpus: there is no free transfer. A survey of domain-adaptation techniques finds that every method — from parameter-efficient tuning to knowledge-graph curricula — has a domain-conditional sweet spot, and visible performance gains routinely hide degradation in reasoning faithfulness, capability transfer, and format flexibility How do domain training techniques actually reshape model behavior?. So "scale to other domains" doesn't mean copy-paste; it means re-finding the sweet spot, and paying a quiet tax each time. In high-stakes settings — medicine, law, finance — that hidden cost to *reasoning faithfulness* is precisely the thing you can least afford.
The deeper limit is about what knowledge bases can and can't do once they're built. Prompt optimization cannot inject knowledge a model never learned — it can only reorganize what's already there, a hard ceiling no clever prompting escapes Can prompt optimization teach models knowledge they lack?. And the reasoning that sits on top of injected knowledge is fragile in a way that matters for high-stakes generalization: chain-of-thought degrades predictably once you move outside the training distribution, producing fluent-but-invalid logic Does chain-of-thought reasoning actually generalize beyond training data?, and models tend to fail not at hard problems but at *unfamiliar* ones — they fit instance-level patterns rather than transferable algorithms Do language models fail at reasoning due to complexity or novelty?.
Put together, the corpus reframes your question. Expert-derived knowledge bases *do* scale across domains — but only the scaffolding scales (taxonomies, primitives, codified rules, structure-aware retrieval like routing queries to the right knowledge form Can routing queries to task-matched structures improve RAG reasoning?). The expert *content* and the per-domain calibration don't, and the reasoning layer stays brittle exactly at the novel, edge-case situations where high-stakes domains live. So the honest answer is: the recipe transfers, the dish must be cooked fresh each time, and you should budget for the hidden costs before you trust it with stakes.
Sources 9 notes
Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.
StructTuning achieves 50% of full-corpus performance using only 0.3% of training data by organizing chunks into auto-generated domain taxonomies. The model learns knowledge position within conceptual structures rather than raw text patterns, matching how students learn from textbooks.
An industrial case study embedding domain rules and design principles into an LLM agent's scaffolding achieved 206% output-quality improvement and expert-level ratings from non-experts, bypassing the need for specialist oversight. The capability gain came from externalizing tacit expertise into structured harness components, not from model scale.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.