INQUIRING LINE

How do task difficulty and skill type interact in model performance?

This explores how the *difficulty* of a task and the *kind of skill* it demands aren't independent dials — the corpus suggests difficulty changes which skills a model builds, and different skills respond to difficulty (and scale) in opposite directions.


This reads the question as: do difficulty and skill type interact, or do they act independently on performance? The corpus is emphatic that they interact — and in ways that make a single accuracy number deeply misleading. The sharpest result is that difficulty doesn't just make a task harder, it changes *what the model learns*. Easy problems reinforce answer shortcuts while actively suppressing deliberation; hard problems only activate genuine reasoning on the rare occasions the model succeeds; medium difficulty is the sweet spot that strengthens both at once What reasoning features does each difficulty level reinforce?. So two training runs can post identical accuracy gains while moving the model's internals in opposite directions — one toward reasoning, one toward shortcutting.

Push difficulty too far and the interaction turns toxic. Training on near-impossible problems doesn't just fail to teach — it teaches *degenerate* skills, because group-relative reward normalization treats a lucky accidental success as a high-value trajectory and reinforces answer-repetition and computation-skipping. Worse, those shortcuts then contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. Difficulty, in other words, isn't a scalar that uniformly raises or lowers performance; it selects which behaviors get amplified.

The skill-type axis is just as uneven. When you decompose performance into distinct skills, they scale with model size at completely different rates — logical reasoning keeps climbing steeply, while metacognition saturates around 7B parameters and stylistic skills plateau early. This is why distilled open-source models can convincingly imitate a teacher's *style* while failing at its *reasoning*: distillation copies form, not substance Do all AI skills improve equally as models scale?. The same form-vs-substance split shows up in instruction tuning, where models trained on semantically empty or even wrong instructions match models trained on correct ones — what transfers is knowledge of the output *format*, a shallow skill, not task understanding Does instruction tuning teach task understanding or output format?.

The two axes meet most cleanly in how models physically respond to hard tasks. As difficulty rises, hidden states sparsify in a localized, systematic way that tracks unfamiliarity and reasoning load — an adaptive filter that stabilizes performance rather than a breakdown Do language models sparsify their activations under difficult tasks?. But a tempting proxy for difficulty turns out to be a trap: longer chain-of-thought traces correlate with harder problems only *in-distribution*. Out-of-distribution, trace length decouples from difficulty entirely and instead reflects how close a problem sits to training schemas Does longer reasoning actually mean harder problems?. So 'the model thinks longer, it must be working harder' is a skill-type confusion dressed up as a difficulty signal.

The lateral lesson the corpus keeps circling: skill type often determines the *direction* of an effect that difficulty or training only sets the *magnitude* of. Multi-task RL shows structured domains drive output entropy down while creative domains drive it up — so the *order* you train them in matters, and doing structured tasks first protects open-ended capability Does training order reshape how models handle different task types?. Preference tuning reduces diversity in code but increases it in creative writing, because each domain rewards a different thing Does preference tuning always reduce diversity the same way?. If you want one takeaway you didn't come looking for: there is no domain-neutral 'harder = better signal' rule. The same training pressure sharpens one skill and corrupts another, and difficulty is the knob that decides which.


Sources 8 notes

What reasoning features does each difficulty level reinforce?

Easy problems reinforce answer shortcuts while suppressing deliberation; hard problems activate reasoning features only on rare success; medium difficulty strengthens both simultaneously. Identical accuracy gains can reflect opposite internal changes.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Do all AI skills improve equally as models scale?

FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Next inquiring lines