Reinforcement Learning

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

JJiyuan WangCChunyu LinLLei SunZZhi CaoYYuyang YinLLang NieZZhenlong YuanXXiangxiang ChuYYunchao WeiKKang LiaoGGuosheng Lin
Published
March 3, 2026
Authors
11
Word Count
6,882

RL-based 3D scene editing achieves multi-view consistency using frozen foundation models as reward functions.

Abstract

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose RL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

Key Takeaways

  • 1

    RL3DEdit uses reinforcement learning with a 3D foundation model as reward function to achieve multi-view consistent 3D scene editing without paired training data.

  • 2

    The method leverages VGGT's confidence maps as a proxy for multi-view consistency, validated through empirical analysis showing confidence drops with inconsistent edits.

  • 3

    RL3DEdit enables single-pass inference over 2× faster than previous methods while handling geometry-changing edits like object addition and character pose changes.

Limitations

  • Method requires base editors with multi-image joint editing capabilities, limiting compatibility to specific 2D editing models like FLUX-Kontext.

  • Approach relies on VGGT foundation model availability and may not generalize to domains significantly different from VGGT's training data distribution.

Keywords

diffusion modelsreinforcement learning3D editingmulti-view consistencysupervised fine-tuningVGGTreward signals3D foundation model

More in Reinforcement Learning

View all
Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing | Paperchime