Reinforced Attention Learning

Paper · arXiv 2602.04884
LLM ArchitectureNovel LLM ArchitecturesInference-Time Scaling

Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Our results position attention policies as a principled and general alternative for multimodal post-training.

We attribute this limitation to the insufficiency of next-token prediction as the fundamental policy objective in MLLM post-training. In typical MLLM architectures, visual inputs are encoded as tokens and projected into the textual embedding space to serve as context for generation. Accurately answering fine-grained questions requires the model to precisely identify and attend to task-relevant visual information. This process is governed by the Transformer's attention mechanism, which must learn to assign high weights to salient multimodal tokens. Standard RLHF, however, optimizes for the result (the token) rather than the process (the internal information allocation).

Inspired by this observation, we reformulate the post-training policy for MLLMs to operate directly on the attention distribution during generation. Unlike traditional methods, RAL treats the attention pattern itself as the policy: when a response receives a high reward, the algorithm encourages the underlying attention distribution by minimizing the divergence between the current and reference policies. Conversely, for low-reward responses, the model is penalized by increasing the divergence from those sub-optimal patterns. By shifting the optimization target from token likelihood to attention-based allocation, RAL fine-tunes MLLMs more directly for multimodal alignment.

The efficacy of optimizing attention distributions naturally extends to On-Policy Distillation. While traditional distillation focuses on token-level probability alignment, we propose a dual-distillation approach that transfers knowledge via both token and attention distribution alignment. Our experiments indicate that the inclusion of attention distillation provides significant additional performance gains. More recent work has explored on-policy distillation, in which the student generates responses under its own policy and receives supervision from teacher evaluations along these trajectories. Compared to offline knowledge distillation on static datasets, on-policy distillation mitigates exposure bias and better aligns the student's generation distribution with deployment-time behavior.

We introduced Reinforced Attention Learning, a MLLM post-training paradigm that shifts optimization from text token distribution to internal attention distributions. By treating attention as a policy, RAL directly reinforces visual grounding and perceptual focus, addressing a fundamental limitation of outcome-based RL methods that neglect the underlying cross-modal reasoning process. These results validate our hypothesis that supervising internal information allocation yields a more reliable and generalizable training signal than next-token gradients alone. Ultimately, this work establishes attention distributions as a first-class optimization target for multimodal alignment — offering a principled, process-aware alternative to standard RLHF.