MLLM-CBench: A Comprehensive Benchmark for Continual Instruction Tuning of Multimodal LLMs with Chain-of-Thought Reasoning Analysis

Paper · arXiv 2508.08275 · Published July 31, 2025
MultimodalReinforcement LearningTraining Fine Tuning

However, real-world deployment demands continuous adaptation to evolving instructions and domain requirements—a paradigm known as continual instruction tuning (He et al. 2023a), where the model incrementally learns from new tasks while retaining prior capabilities.

While significant progress has been made in continual instruction tuning for Large Language Models (LLMs) (Zheng et al. 2025a), the multimodal counterpart remains underexplored. The absence of a rigorous benchmark further impedes progress: existing benchmarks (e.g., EMT (Jia et al. 2025), CITB (He et al. 2023b), CoIN (Chen et al. 2024a)) on continual instruction tuning of MLLMs exhibit several critical limitations. 1) Superficial Evaluation Paradigms: Prevailing benchmarks prioritize final answer correctness while neglecting granular reasoning process analysis, hindering in-depth understanding of the causes behind catastrophic forgetting in MLLMs (Luo et al. 2023). Although CoIN (Chen et al. 2024a) implicitly estimates reasoning knowledge forgetting , the interpretability of the evaluation metric remains limited. 2) Limited exploration of training algorithms and paradigms: Existing works predominantly focus on quantifying catastrophic forgetting under sequential fine-tuning settings, while overlooking systematic investigations of continual learning algorithms’ efficacy, thus limiting their impact. Furthermore, alternative training paradigms like reinforcement learning (RL), which may offer improved trade-offs between stability and plasticity, remain largely unexplored.