Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling

Paper · arXiv 2507.14783 · Published July 20, 2025
Reward ModelsReinforcement Learning

The pursuit of general-purpose artificial intelligence depends on large language models (LLMs) that can handle both structured reasoning and open-ended generation. We present OMNI-THINKER, a unified reinforcement learning (RL) framework that scales LLMs across diverse tasks by combining hybrid rewards with backward-transfer–guided scheduling. Hybrid rewards integrate rule-based verifiable signals with preference-based evaluations from an LLM-as-a-Judge, enabling learning in both deterministic and subjective domains. Our scheduler orders tasks according to accuracy backward transfer (BWT), reducing forgetting and improving multi-task performance. Experiments across four domains show gains of 6.2% over joint training and 12.4% over model merging. Moreover, we demonstrate that simple assumptions on accuracy transfer yield accurate predictions of curriculum outcomes, with entropy dynamics explaining deviations due to generative tasks. These findings underscore the importance of BWT-aware scheduling and hybrid supervision for scaling RL-based post-training toward general-purpose LLMs.

We address this challenge with OMNI-THINKER, a unified RL framework that enables LLMs to learn from both rule-based and generative supervision under a single policy. Building on Reinforcement Learning with Verified Reward (RLVR), our method integrates symbolic verifiers with LLM-as-a- Judge evaluations (Zheng et al., 2023; Zhang et al., 2025) to handle subjective tasks. Our curriculum is forgetting-aware; it is guided by backward transfer (BWT), where BWT denotes test-performance backward transfer computed on a normalized, task-specific test metric. Ordering task training according to this signal yields effective curricula across heterogeneous domains. We show that the final accuracy of model after curriculum learning is well predicted by forgettability ranking, even under simplifying assumptions. Empirically, we observe complementary entropy dynamics, fine-tuning on creative writing tends to increase the model’s output entropy, whereas training on verifier-supervised, structured tasks tends to decrease it; this trend is consistent with our BWT-guided choice to train structured tasks before open-ended ones. Across four domains, OMNI-THINKER improves generalization while reducing forgetting, with average gains of 6.2% over joint multi-task training and 12.4% over model merging, respectively.

Our key contributions are threefold. (1) We present OMNI-THINKER, a unified framework that trains a single policy across four diverse domains, using hybrid verifiable and preference-based rewards. (2) We develop a forgetting-aware curriculum based on backward transfer (BWT) linear ordering maximization over task-specific test performance to reduce forgetting, outperforming joint multi-task training and model merging. (3) We empirically analyze training dynamics through the lens of entropy, revealing that structured domains (math, coding) systematically decrease output entropy while open-ended domains (creative writing) increase it, thereby providing an explanatory link between entropy evolution and the effectiveness of BWT-guided curricula.

Short-Form Open-Ended Supervision. For language tasks with known or extractable ground-truth answers such as general question answering (QA), we reformulate queries into open-ended prompts and incorporate distractor responses (LLM-generated plausible but incorrect answers) into the context. Instead of labeling options, we prompt the model to reason using the ... format and to output answers within ... tags. Responses are evaluated with a binary reward based on string matching or set membership against reference answers, thereby encouraging semantic grounding and mitigating shallow pattern memorization. We find that conditioning the LLM on a diverse set of candidate options, including one correct answer and multiple distractors, is key to steadily improving general-domain reasoning while reducing susceptibility to random guessing or reward hacking, compared to directly prompting the model to generate open-ended answers during training without the augmented context.

Long-Form Open-Ended Supervision. For subjective tasks lacking ground truth (e.g., dialogue, writing), we use an LLM-as-a-Judge (Chen et al., 2025) to assign a scalar reward based on rubric aligned pairwise preferences between candidate outputs. This enables learning in domains where symbolic correctness is insufficient or intractable. This prompt-based approach leverages recent advances in the general reasoning capabilities of LLMs, using generated chain-of-thoughts to elicit a ternary reward signal, preferred, tie, or dispreferred, without requiring large-scale preference data collection and reward model training.

Together, these components form a unified hybrid reward scheme: verifiable rewards ensure correctness where possible and generative-based signals cover subjective domains. This design enables reinforcement learning to scale across diverse tasks, from reasoning to open-ended generation.