Multimodal AI

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

HHyeonbeom ChoiDDaechul AhnYYouhan LeeTTaewook KangSSeongwon ChoJJonghyun Choi
Published
February 4, 2026
Authors
6
Word Count
11,334
Code
Includes code

Adaptive VLA inference via self-uncertainty modulation.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

Key Takeaways

  • 1

    SCALE modulates perception and action based on self-uncertainty.

  • 2

    Enhances VLA robustness without additional training or multiple passes.

  • 3

    Adapts to perceptual ambiguity and action multimodality effectively.

Limitations

  • Requires a well-trained VLA model for effective self-uncertainty.

  • May not fully resolve all perceptual ambiguities in complex scenarios.

Keywords

Vision-Language-Action modelstest-time scalingactive inferenceself-uncertaintyvisual perceptionaction decodingperceptual ambiguityuncertainty-driven explorationsingle-pass inferenceadaptive execution

More in Multimodal AI

View all
SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models | Paperchime