Multimodal AI

AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

CChangwoo BaekJJouwon SongSSohyeon KimKKyeongbo Kong
Published
March 1, 2026
Authors
4
Word Count
13,313

Empirical study reveals diversity-based token pruning increases hallucinations; image complexity should guide pruning strategy selection.

Abstract

Large Vision-Language Models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Our erank-based quantitative analysis shows that many diversity-oriented pruning methods preserve substantially less feature diversity than intended; moreover, analysis using the CHAIR dataset reveals that the diversity they do retain is closely tied to increased hallucination frequency compared to attention-based pruning. (2) We further observe that attention-based approaches are more effective on simple images where visual evidence is concentrated, while diversity-based methods better handle complex images with distributed features. Building on these empirical insights, we show that incorporating image-aware adjustments into existing hybrid pruning strategies consistently improves their performance. We also provide a minimal instantiation of our empirical findings through a simple adaptive pruning mechanism, which achieves strong and reliable performance across standard benchmarks as well as hallucination-specific evaluations. Our project page available at https://cvsp-lab.github.io/AgilePruner.

Key Takeaways

  • 1

    Diversity-based pruning methods preserve less actual diversity than intended and correlate with higher hallucination rates.

  • 2

    Attention-based pruning excels on simple images while diversity-based methods perform better on complex images with distributed features.

  • 3

    Image-aware adaptive pruning that adjusts strategy based on image complexity achieves strong performance across benchmarks.

Limitations

  • Analysis focuses primarily on image captioning tasks; generalization to other vision-language tasks unclear.

  • Minimal adaptive instantiation may lack sophistication compared to more complex hybrid pruning approaches.

Keywords

visual token pruninglarge vision-language modelseffective rankattention score entropyfeature diversityhallucination frequencyhybrid pruning strategiesadaptive pruning mechanism

More in Multimodal AI

View all
AgilePruner: An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models | Paperchime