Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does training order reshape how models handle different task types?

Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.

Note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models?

The standard framing of Does policy entropy collapse limit reasoning performance in RL? treats entropy collapse as a uniform phenomenon — RL training decreases entropy. Omni-Thinker (2025) reveals this is domain-dependent: structured domains (math, coding) decrease output entropy, while open-ended domains (creative writing, dialogue) increase it.

This is not a minor observation — it makes training order a mechanistic variable, not just a scheduling convenience. If you train creative writing first and structured reasoning second, the structured training will collapse the entropy that creative training expanded, potentially degrading creative capability. If you train structured reasoning first and creative writing second, the creative training preserves and expands the model's expressive range. The ordering effect is predictable from backward transfer (BWT) measurements.

Omni-Thinker uses BWT-guided scheduling: order tasks so that later tasks experience minimal negative backward transfer from earlier tasks. The approach uses hybrid rewards — verifiable (rule-based) for deterministic domains + preference-based (LLM-as-Judge) for subjective domains — enabling unified training across domain types within a single policy. The "short-form" QA tasks condition on distractors to reduce reward hacking from random guessing.

The gains are substantial: 6.2% over joint multi-task training, 12.4% over model merging. The accuracy of final multi-task models is well-predicted by forgettability rankings, even under simplifying assumptions — suggesting BWT-guided scheduling has principled theoretical grounding.

This extends Does gradually tightening token budgets beat fixed budget training? from temporal budgets to task ordering: the dimension that matters for multi-task RL is not just how much compute per task, but which tasks come first. And it enriches the entropy collapse understanding: entropy collapse is not a bug to fix everywhere — in structured domains, it reflects desirable precision. The problem is when structured-domain entropy collapse propagates to damage open-ended capabilities.


Source: Reward Models — Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling (arxiv 2507.14783)

Related concepts in this collection

Concept map
16 direct connections · 144 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

multi-task rl reveals complementary entropy dynamics — structured domains systematically decrease output entropy while creative domains increase it making training order a mechanistic variable