Multimodal AI

VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model

JJingwen SunWWenyao ZhangZZekun QiSShaojie RenZZezhi LiuHHanxin ZhuGGuangzhong SunXXin JinZZhibo Chen
Published
February 10, 2026
Authors
9
Word Count
10,109

VLA-JEPA teaches robots from video by predicting latent world states, not pixels.

Abstract

Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is leakage-free state prediction: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.

Key Takeaways

  • 1

    VLA-JEPA solves pixel-prediction bias by predicting in latent space rather than pixel space.

  • 2

    Leakage-free state prediction prevents information shortcuts by hiding future frames from the student pathway.

  • 3

    The framework leverages internet-scale video data to train robots without expensive robot-collected datasets.

Limitations

  • Current latent-action approaches suffer from pixel-level supervision biasing models toward appearance over control.

  • Real-world videos contain camera motion and background changes that overwhelm actual interaction signals.

Keywords

Vision-Language-ActionJEPAlatent-action objectivespixel variationaction-relevant state transitionsappearance biasnuisance motioninformation leakagetarget encoderstudent pathwaylatent representationsfuture framescurrent observationlatent spacedynamics abstractionscamera motionbackground changesJEPA pretrainingaction-head fine-tuninggeneralizationrobustness

More in Multimodal AI

View all
VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model | Paperchime