Reasoning and Learning Architectures

Can optimizing attention patterns improve multimodal RL better than optimizing tokens?

Standard RL training optimizes token outputs in multimodal models, but the real bottleneck may be where the model attends to visual information. Does steering attention directly outperform indirect optimization through final outputs?

Note · 2026-05-18 · sourced from LLM Architecture

Standard post-training with RL improves reasoning in language models by optimizing token-level outputs. Extending the same paradigm to multimodal LLMs through verbose rationales yields limited gains for perception tasks and can even degrade performance. The diagnosis in Reinforced Attention Learning is that next-token prediction is the wrong policy objective when the actual bottleneck is information allocation in attention.

The mechanism: in MLLM architectures, visual inputs are encoded as tokens and projected into the textual embedding space. Accurate visual question-answering requires the model to precisely identify and attend to task-relevant visual information. This identification is the work of the attention mechanism — assigning high weights to salient multimodal tokens. Standard RLHF optimizes the result (the output token sequence) rather than the process (the internal information allocation). The policy gradient never reaches where the real decision happens.

RAL reformulates the post-training policy to operate directly on the attention distribution during generation. When a response receives high reward, the algorithm encourages the underlying attention pattern by minimizing divergence between the current attention and a reference. When reward is low, the model is penalized by increasing divergence from those sub-optimal attention patterns. Attention becomes the policy object; tokens become a downstream observable.

This is structurally distinct from RLHF. RLHF is outcome-based RL where the gradient flows from a scalar reward through the token-generation chain. RAL is process-aware RL where the gradient flows directly to attention distributions, treating the information-allocation step as a first-class policy. The two are not interchangeable — they reinforce different aspects of the model's behavior.

The pattern generalizes. Wherever the bottleneck on a task is internal to the model rather than at the output, optimizing the output is a leaky channel for steering the bottleneck. Attention here, but in principle: gating decisions in MoE, retrieval choices in RAG, tool-selection in agents — all candidates for direct policy optimization rather than mediated optimization through final outputs.

Related concepts in this collection

Concept map
14 direct connections · 102 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

attention distributions are first-class policy optimization targets for multimodal RL — optimizing where to attend beats optimizing what to generate