Multimodal AI

Reinforced Attention Learning

BBangzheng LiJJianmo NiCChen QuIIan MiaoLLiu YangXXingyu FuMMuhao ChenDDerek Zhiyuan Cheng
Published
February 4, 2026
Authors
8
Word Count
6,124

Enhance MLLM perception with Reinforced Attention Learning.

Abstract

Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.

Key Takeaways

  • 1

    RAL optimizes internal attention distributions for better perception.

  • 2

    Traditional methods often fail to improve perceptual accuracy.

  • 3

    RAL enhances model's ability to focus on relevant information.

Limitations

  • RAL may require more computational resources than traditional methods.

  • The effectiveness of RAL in diverse real-world scenarios is yet to be fully explored.

Keywords

Reinforcement LearningLarge Language ModelsMultimodal LLMspolicy-gradient frameworkattention distributionstest-time scalingGRPOOn-Policy Attention Distillationcross-modal alignmentknowledge distillation

More in Multimodal AI

View all
Reinforced Attention Learning | Paperchime