Multimodal AI

Causal-JEPA: Learning World Models through Object-Level Latent Interventions

HHeejeong NamQQuentin Le LidecLLucas MaesYYann LeCunRRandall Balestriero
Published
February 11, 2026
Authors
5
Word Count
13,662

Causal-JEPA learns world models by masking objects to force interaction reasoning in latent space.

Abstract

World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By applying object-level masking that requires an object's state to be inferred from other objects, C-JEPA induces latent interventions with counterfactual-like effects and prevents shortcut solutions, making interaction reasoning essential. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20\% in counterfactual reasoning compared to the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1\% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces a causal inductive bias via latent interventions. Our code is available at https://github.com/galilai-group/cjepa.

Key Takeaways

  • 1

    Causal-JEPA forces models to learn object interactions by masking objects during training, preventing shortcut learning.

  • 2

    Object-level masking makes interaction reasoning functionally necessary, unlike previous approaches that only incentivized it.

  • 3

    The method elegantly solves world modeling without requiring explicit causal graphs or task-specific architectural constraints.

Limitations

  • The script doesn't explain how the frozen encoder produces object-centric representations or implementation details.

  • No experimental results or performance comparisons with other world modeling approaches are mentioned in the lesson.

Keywords

object-centric representationsmasked joint embedding predictioncounterfactual reasoninglatent interventionscausal inductive biasvisual question answeringagent control tasks

More in Multimodal AI

View all
Causal-JEPA: Learning World Models through Object-Level Latent Interventions | Paperchime