Computer Vision

UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders

MMatthew WalmerSSaksham SuriAAnirud AggarwalAAbhinav Shrivastava
Published
January 25, 2026
Authors
4
Word Count
12,448

Efficient, high-resolution feature upsampling with UPLiFT.

Abstract

The space of task-agnostic feature upsampling has emerged as a promising area of research to efficiently create denser features from pre-trained visual backbones. These methods act as a shortcut to achieve dense features for a fraction of the cost by learning to map low-resolution features to high-resolution versions. While early works in this space used iterative upsampling approaches, more recent works have switched to cross-attention-based methods, which risk falling into the same efficiency scaling problems of the backbones they are upsampling. In this work, we demonstrate that iterative upsampling methods can still compete with cross-attention-based methods; moreover, they can achieve state-of-the-art performance with lower inference costs. We propose UPLiFT, an architecture for Universal Pixel-dense Lightweight Feature Transforms. We also propose an efficient Local Attender operator to overcome the limitations of prior iterative feature upsampling methods. This operator uses an alternative attentional pooling formulation defined fully locally. We show that our Local Attender allows UPLiFT to maintain stable features throughout upsampling, enabling state-of-the-art performance with lower inference costs than existing pixel-dense feature upsamplers. In addition, we apply UPLiFT to generative downstream tasks and show that it achieves competitive performance with state-of-the-art Coupled Flow Matching models for VAE feature upsampling. Altogether, UPLiFT offers a versatile and efficient approach to creating denser features.

Key Takeaways

  • 1

    UPLiFT uses Local Attender for efficient feature upsampling.

  • 2

    Achieves state-of-the-art performance with lower inference costs.

  • 3

    Outperforms other methods in semantic segmentation and super-resolution.

Limitations

  • Performance depends on the quality of the backbone.

  • Requires significant computational resources for larger images.

Keywords

feature upsamplingvisual backbonescross-attentioniterative upsamplingpixel-dense feature upsamplersUPLiFTLocal AttenderCoupled Flow MatchingVAE feature upsampling

More in Computer Vision

View all
UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders | Paperchime