Multimodal AI

What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis

XXirui LiMMing LiTTianyi Zhou
Published
February 12, 2026
Authors
3
Word Count
10,477

RL improves vision-language benchmarks by shifting attention, not necessarily improving vision or reasoning.

Abstract

Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL's reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.

Key Takeaways

  • 1

    RL improves vision-language model benchmarks, but it's unclear whether vision, reasoning, or their combination actually improved.

  • 2

    Fine-grained ability metrics show inconsistent improvements across models, while attention patterns to vision tokens remain consistently increased.

  • 3

    Aggregate benchmarks conflate multiple internal changes, requiring component-level analysis to understand what RL actually optimizes in vision-language models.

Limitations

  • Vision ability cannot be directly observed; researchers infer it indirectly through black image substitution and text description replacement tests.

  • Analysis focuses on three specific training recipes, limiting generalizability of findings to other RL approaches or vision-language architectures.

More in Multimodal AI

View all
What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis | Paperchime