Can reasoning skills trained on law improve performance in STEM?

This explores whether reasoning ability is a portable, domain-general skill — so that practice on legal arguments could carry over to math, physics, or code — or whether it's bound to the content it was trained on.

This explores whether reasoning is a transferable skill (law → STEM) or something glued to the domain it was learned in. The corpus has no law-specific papers, but it has a sharp debate about *why* reasoning sometimes travels and sometimes doesn't — which is the deeper question your line is really asking.

The optimistic case rests on a striking finding: reasoning generalizes because it draws on **broad, transferable procedural knowledge** rather than memorized facts. An analysis of five million pretraining documents showed that when models reason, they lean on diverse sources teaching *how to do things* (a method, a derivation, a proof strategy), whereas factual recall pins to narrow document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. If reasoning is procedural, then a strategy practiced on legal argument — track premises, test for contradictions, chain inferences — is exactly the kind of thing that could resurface in a STEM proof. There's even direct evidence that reasoning RL can be pushed across many domains at once without per-domain answer-checking Can reasoning improvement work without answer verification?.

But the corpus also names the catch. One note found that reasoning training **improves math while degrading knowledge-heavy domains like medicine**, because knowledge lives in the lower network layers and reasoning adjustment happens in the higher ones — sharpen one and you can blunt the other Why does reasoning training help math but hurt medical tasks?. So transfer isn't free: skills move best toward domains that are *also* reasoning-bound (math, logic, code) and worst toward domains that are knowledge-bound. Law-trained reasoning would likely help STEM problem-solving more than it helps fact-retrieval tasks.

The skeptical thread is harsher still. Chain-of-thought reasoning has been shown to be **distribution-bounded** — it degrades predictably when the task, length, or format shifts away from training, producing fluent but logically broken output Does chain-of-thought reasoning actually generalize beyond training data?. Worse, several notes argue that what models learn is the *form* of reasoning, not genuine inference: logically invalid reasoning steps score nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?, and CoT looks like constrained imitation rather than abstract logic What makes chain-of-thought reasoning actually work?. If a model is imitating the *surface pattern* of legal reasoning, that pattern won't necessarily match the surface of a physics problem — and the transfer collapses.

The resolution the corpus points to is that "reasoning" isn't one thing. There's a content-sensitive layer — both humans and models succeed or fail along the same content axis, so demanding pure content-independence is the wrong test Do language models fail reasoning tests that humans pass? — and a portable procedural layer. Law → STEM transfer should work to the extent it carries the procedural skill (systematic search, contradiction-checking) and fail to the extent it carries domain-specific surface patterns. That reframes your question from "does it transfer?" to "which part of legal reasoning is the model actually learning?" — and that's the part worth knowing you wanted to ask.

Sources 7 notes

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Can reasoning improvement work without answer verification?

VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Do language models fail reasoning tests that humans pass?

Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about cross-domain reasoning transfer in LLMs, specifically whether legal reasoning training can improve STEM performance. This remains an open question.

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2025. A library of reasoning research reports:

• Reasoning generalizes via **procedural knowledge** (method, proof strategy) not memorized facts; five million pretraining documents show reasoning draws on diverse *how-to* sources (2024, arXiv:2411.12580).
• **Knowledge and reasoning occupy different network layers**: sharpening reasoning in higher layers can degrade knowledge-heavy domains like medicine, suggesting transfer works best toward reasoning-bound domains (math, logic, code) over knowledge-bound ones (2025, arXiv:2507.18178).
• Chain-of-thought is **distribution-bounded** — degrades predictably when task, length, or format shift from training data, producing fluent but logically broken output (2025, arXiv:2508.01191).
• Models learn the **form of reasoning, not genuine inference**: logically invalid CoT steps score nearly as well as valid ones; reasoning traces look like constrained imitation (2023–2025, arXiv:2307.10573, arXiv:2506.02878).
• Reasoning RL can scale across many domains without per-domain answer-checking, but portability depends on whether domains share procedural structure (2025, arXiv:2505.21493).

Anchor papers (verify; mind their dates):
- arXiv:2411.12580 (2024): Procedural knowledge drives reasoning
- arXiv:2507.18178 (2025): Knowledge and reasoning layer decoupling
- arXiv:2508.01191 (2025): CoT distribution bounds
- arXiv:2307.10573 (2023): Invalid logic performs equivalently

Your task:

(1) RE-TEST EACH CONSTRAINT. For each finding, judge whether newer models (o1, claude-opus 4.x), training methods (verifier-free RL, activation steering), tooling (long-context harnesses), or evaluation have relaxed or overturned it. Separate the durable claim — "transfer works if procedural structure aligns" — from perishable limitations (layer separation, CoT brittleness). Cite what resolved each constraint, or state plainly where it still holds.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers claiming reasoning is NOT distribution-bounded, that surface-pattern imitation suffices for transfer, or that law and STEM share deeper procedural symmetries.

(3) Propose 2 research questions that ASSUME the regime has shifted: e.g., "Can verifier-free RL on legal domain unlock transfer to STEM without explicit knowledge suppression?" or "Do recent test-time scaling methods (arXiv:2506.04210) erase the layer-decoupling penalty?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can reasoning skills trained on law improve performance in STEM?

Sources 7 notes

Next inquiring lines