INQUIRING LINE

How much of the combinatorial task space must training data cover?

This explores how much of the space of possible task-combinations training data has to span before a model can handle the rest — and the corpus suggests the honest answer is 'more than you'd hope,' with some clever escape hatches.


This question is really asking whether models generalize to task-combinations they never saw, or whether they only replay regions of the space they were trained on — and the most sobering result in the collection says the latter. The DataAlchemy experiments Does chain-of-thought reasoning actually generalize beyond training data? show chain-of-thought reasoning degrades *predictably* the moment you shift task, length, or format away from the training distribution. The model keeps producing fluent reasoning, but the logic underneath stops being valid. So in the pessimistic reading, coverage isn't a nice-to-have: capability is bounded to the slice of combinatorial space the data actually touched, and what looks like generalization is interpolation inside that slice.

There's a quieter, stranger finding that reframes what 'coverage' even means. Instruction tuning experiments Does instruction tuning teach task understanding or output format? show models trained on *semantically empty or deliberately wrong* instructions perform about as well as those trained on correct ones (43% vs. 42.6%). What transfers isn't understanding of the tasks — it's familiarity with the shape of the output space. If that's true, then a lot of what we think we're covering (task semantics) is irrelevant, and the thing data actually needs to span is the distribution of *answer formats*, which is a much smaller space.

The most practical escape from brute-force coverage is decomposition. Granite's function-calling work Can breaking function calling into subtasks improve model generalization? found that breaking the job into seven atomic subtasks — nested calls, chaining, parallel functions, parameter detection, and so on — and training each explicitly generalizes *better* than dumping one giant umbrella dataset on the model. This is the combinatorial trick: if the space factors into a handful of reusable primitives, you cover the primitives, not their exponential product. DPO training pushes the same idea from the other direction Can small models match large models on function calling? — feeding explicit *wrong* examples teaches the boundaries of a subtask cheaply, so small models match large ones without seeing every variant.

But here's the thing you might not have known you wanted to know: some failures have nothing to do with coverage at all. The 'embers of autoregression' work Can we predict where language models will fail? predicted *in advance* that tasks with low-probability target outputs — reciting the alphabet backwards, counting letters — would stay hard no matter how logically trivial they are, because the model is fundamentally a next-token probability machine. You could cover those tasks exhaustively in training and the autoregressive prior would still fight you. So the real answer isn't a single coverage percentage. It's that the space has structure: factorable regions where decomposition lets you cover a fraction and compose the rest, and probability-cursed regions where coverage barely helps.

If you want to chase the optimistic thread further, look at how training *order* over that space matters too — scheduling structured tasks before open-ended ones changes what survives Does training order reshape how models handle different task types? — which hints that *how* you walk through the combinatorial space may matter as much as how much of it you cover.


Sources 6 notes

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Next inquiring lines