RigMo: Unifying Rig and Motion Learning for Generative Animation

HHao ZhangJJiahao LuoBBohui WanYYizhou ZhaoZZongrui LiMMichael VasilkovskyCChaoyang WangJJian WangNNarendra AhujaBBing Zhou

arXiv ID: 2601.06378
Published: January 10, 2026
Authors: 10
Hugging Face Likes: 7
Comments: 2

View on arXiv Download PDF

Abstract

Despite significant progress in 4D generation, rig and motion, the core structural and dynamic components of animation are typically modeled as separate problems. Existing pipelines rely on ground-truth skeletons and skinning weights for motion generation and treat auto-rigging as an independent process, undermining scalability and interpretability. We present RigMo, a unified generative framework that jointly learns rig and motion directly from raw mesh sequences, without any human-provided rig annotations. RigMo encodes per-vertex deformations into two compact latent spaces: a rig latent that decodes into explicit Gaussian bones and skinning weights, and a motion latent that produces time-varying SE(3) transformations. Together, these outputs define an animatable mesh with explicit structure and coherent motion, enabling feed-forward rig and motion inference for deformable objects. Beyond unified rig-motion discovery, we introduce a Motion-DiT model operating in RigMo's latent space and demonstrate that these structure-aware latents can naturally support downstream motion generation tasks. Experiments on DeformingThings4D, Objaverse-XL, and TrueBones demonstrate that RigMo learns smooth, interpretable, and physically plausible rigs, while achieving superior reconstruction and category-level generalization compared to existing auto-rigging and deformation baselines. RigMo establishes a new paradigm for unified, structure-aware, and scalable dynamic 3D modeling.

Keywords

generative frameworkrig latentmotion latentSE(3) transformationsauto-riggingdeformation baselineslatent spacemesh sequencesanimatable meshstructure-aware latents

More in Generative AI

View all

VIBE: Visual Instruction Based Editor

Grigorii Alekseenko, Aleksandr Gordeev +8

Instruction-based image editing is among the fastest developing areas in generative AI. Over the past year, the field has reached a new level, with dozens of open-source models released alongside high...

Jan 558

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Shengbang Tong, Boyang Zheng +8

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this fr...

Jan 2246

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Pengze Zhang, Yanze Wu +9

Videos convey richer information than images or text, capturing both spatial and temporal dynamics. However, most existing video customization methods rely on reference images or task-specific tempora...

Jan 2034

CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Chengzhuo Tong, Mingkun Chang +13

Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applie...

Jan 1528

Transition Matching Distillation for Fast Video Generation

Weili Nie, Julius Berner +4

Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient mu...

Jan 1428

More Generative AI papers