INQUIRING LINE

How does business logic specification replace annotated training datasets?

This explores whether you can train models by specifying rules and checks (the 'business logic') instead of hand-labeling thousands of examples — letting verification and preference signals do the teaching that annotation used to.


This reads the question as: can you swap hand-annotated datasets for a specification of the logic a system must satisfy, and let the model learn against that spec? The corpus has no single paper on 'business logic specification' by that name, but several notes triangulate the shift it describes — and the most honest answer is that the real substitution isn't spec-for-data, it's *verifier-for-annotator*.

The clearest mechanism is preference learning. In Can small models match large models on function calling?, a small model is taught not by labeled examples but by correct/incorrect pairs a larger teacher generates — and it's the explicit *negative* examples that fix the rigid output-format failures plain supervised fine-tuning can't. You're no longer annotating; you're specifying what counts as valid and letting the contrast carry the signal. Verifiable-reward training points the same way: Do high-entropy tokens drive reasoning model improvements? shows only ~20% of tokens actually carry the learning signal, so a checkable reward replaces dense per-token labels. And Does RL post-training create reasoning or just deploy it? argues the capability is already latent — RL teaches *when* to deploy reasoning, not how — which means the 'data' you need to specify behavior is far thinner than the annotation framing assumes.

But the corpus also marks where this substitution breaks. Do large language models reason symbolically or semantically? is the sharp caution: when you hand a model the correct rules in context but strip the familiar semantics, performance collapses — models lean on token associations, not formal logic. So a clean 'business logic spec' doesn't reliably transfer the way a programmer would hope; the model may ignore the rule and pattern-match instead. Stranger still, Do reasoning traces need to be semantically correct? finds that even *wrong* reasoning traces teach about as well as correct ones — suggesting specification often works as computational scaffolding that triggers the right behavior, not as logic the model genuinely internalizes.

The hard ceiling is What stops large language models from improving themselves?: the generation-verification gap means every reliable correction needs something external to validate and enforce it. That's exactly why specification can replace annotation — a verifier is external validation in a compact form — but also why you can never fully escape needing a checker. You trade the cost of labeling examples for the cost of writing a spec the model can be measured against.

So what a curious reader might not expect: replacing datasets with logic doesn't remove the labor, it relocates it. The annotator who once tagged examples becomes the engineer who designs the verifier and the negative cases — and whether that works at all depends on whether your task has a checkable rule (function-calling format, a math answer) or only a semantic one the model will quietly route around.


Sources 6 notes

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Next inquiring lines