Can reasoning skills trained on law improve performance in STEM?
This explores whether reasoning ability is a portable, domain-general skill — so that practice on legal arguments could carry over to math, physics, or code — or whether it's bound to the content it was trained on.
This explores whether reasoning is a transferable skill (law → STEM) or something glued to the domain it was learned in. The corpus has no law-specific papers, but it has a sharp debate about *why* reasoning sometimes travels and sometimes doesn't — which is the deeper question your line is really asking.
The optimistic case rests on a striking finding: reasoning generalizes because it draws on **broad, transferable procedural knowledge** rather than memorized facts. An analysis of five million pretraining documents showed that when models reason, they lean on diverse sources teaching *how to do things* (a method, a derivation, a proof strategy), whereas factual recall pins to narrow document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. If reasoning is procedural, then a strategy practiced on legal argument — track premises, test for contradictions, chain inferences — is exactly the kind of thing that could resurface in a STEM proof. There's even direct evidence that reasoning RL can be pushed across many domains at once without per-domain answer-checking Can reasoning improvement work without answer verification?.
But the corpus also names the catch. One note found that reasoning training **improves math while degrading knowledge-heavy domains like medicine**, because knowledge lives in the lower network layers and reasoning adjustment happens in the higher ones — sharpen one and you can blunt the other Why does reasoning training help math but hurt medical tasks?. So transfer isn't free: skills move best toward domains that are *also* reasoning-bound (math, logic, code) and worst toward domains that are knowledge-bound. Law-trained reasoning would likely help STEM problem-solving more than it helps fact-retrieval tasks.
The skeptical thread is harsher still. Chain-of-thought reasoning has been shown to be **distribution-bounded** — it degrades predictably when the task, length, or format shifts away from training, producing fluent but logically broken output Does chain-of-thought reasoning actually generalize beyond training data?. Worse, several notes argue that what models learn is the *form* of reasoning, not genuine inference: logically invalid reasoning steps score nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, and CoT looks like constrained imitation rather than abstract logic What makes chain-of-thought reasoning actually work?. If a model is imitating the *surface pattern* of legal reasoning, that pattern won't necessarily match the surface of a physics problem — and the transfer collapses.
The resolution the corpus points to is that "reasoning" isn't one thing. There's a content-sensitive layer — both humans and models succeed or fail along the same content axis, so demanding pure content-independence is the wrong test Do language models fail reasoning tests that humans pass? — and a portable procedural layer. Law → STEM transfer should work to the extent it carries the procedural skill (systematic search, contradiction-checking) and fail to the extent it carries domain-specific surface patterns. That reframes your question from "does it transfer?" to "which part of legal reasoning is the model actually learning?" — and that's the part worth knowing you wanted to ask.
Sources 7 notes
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.
Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.