Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does training order reshape how models handle different task types?

Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.

Note · 2026-02-22 · sourced from Reward Models

The standard framing of Does policy entropy collapse limit reasoning performance in RL? treats entropy collapse as a uniform phenomenon — RL training decreases entropy. Omni-Thinker (2025) reveals this is domain-dependent: structured domains (math, coding) decrease output entropy, while open-ended domains (creative writing, dialogue) increase it.

This is not a minor observation — it makes training order a mechanistic variable, not just a scheduling convenience. If you train creative writing first and structured reasoning second, the structured training will collapse the entropy that creative training expanded, potentially degrading creative capability. If you train structured reasoning first and creative writing second, the creative training preserves and expands the model's expressive range. The ordering effect is predictable from backward transfer (BWT) measurements.

Omni-Thinker uses BWT-guided scheduling: order tasks so that later tasks experience minimal negative backward transfer from earlier tasks. The approach uses hybrid rewards — verifiable (rule-based) for deterministic domains + preference-based (LLM-as-Judge) for subjective domains — enabling unified training across domain types within a single policy. The "short-form" QA tasks condition on distractors to reduce reward hacking from random guessing.

The gains are substantial: 6.2% over joint multi-task training, 12.4% over model merging. The accuracy of final multi-task models is well-predicted by forgettability rankings, even under simplifying assumptions — suggesting BWT-guided scheduling has principled theoretical grounding.

This extends Does gradually tightening token budgets beat fixed budget training? from temporal budgets to task ordering: the dimension that matters for multi-task RL is not just how much compute per task, but which tasks come first. And it enriches the entropy collapse understanding: entropy collapse is not a bug to fix everywhere — in structured domains, it reflects desirable precision. The problem is when structured-domain entropy collapse propagates to damage open-ended capabilities.

Source: Reward Models — Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling (arxiv 2507.14783)

Related concepts in this collection

Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
entropy dynamics are domain-dependent, not uniformly negative; structured tasks decrease entropy while creative tasks increase it
Does gradually tightening token budgets beat fixed budget training? Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
BWT-guided scheduling extends curriculum insight from temporal budgets to task ordering
Why do reasoning models fail differently at training versus inference? Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
Omni-Thinker adds that entropy direction depends on task type, further complicating the dual problem
Does RL training collapse format diversity in pretrained models? Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.
multi-task training with BWT scheduling may partially address format convergence by exposing the model to diverse task types
Can isolating task-specific parameters prevent multi-task fine-tuning interference? Explores whether identifying and protecting task-specific parameter regions can prevent the performance degradation that occurs when fine-tuning models on multiple tasks simultaneously. This matters because it could enable safe multi-task adaptation without sacrificing individual task performance.
complementary multi-task approach: CPI-FT addresses interference through spatial parameter isolation while Omni-Thinker uses temporal task ordering; CPI-FT shows temporal scheduling alone is insufficient, suggesting combining both spatial isolation and BWT-guided ordering could further improve multi-task training

Concept map

16 direct connections · 144 in 2-hop network ·dense cluster

Does training order reshape how models handle di… Does policy entropy collapse limit reasoning perfo… Does gradually tightening token budgets beat fixed… Why do reasoning models fail differently at traini… Does RL training collapse format diversity in pret… Can isolating task-specific parameters prevent mul…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

multi-task rl reveals complementary entropy dynamics — structured domains systematically decrease output entropy while creative domains increase it making training order a mechanistic variable