Multimodal AI

Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation

ZZihan SuHHongyang WeiKKangrui CenYYong WangGGuanhua ChenCChun YuanXXiangxiang Chu
Published
January 29, 2026
Authors
7

Abstract

Unified Multimodal Models (UMMs) integrate both visual understanding and generation within a single framework. Their ultimate aspiration is to create a cycle where understanding and generation mutually reinforce each other. While recent post-training methods have successfully leveraged understanding to enhance generation, the reverse direction of utilizing generation to improve understanding remains largely unexplored. In this work, we propose UniMRG (Unified Multi-Representation Generation), a simple yet effective architecture-agnostic post-training method. UniMRG enhances the understanding capabilities of UMMs by incorporating auxiliary generation tasks. Specifically, we train UMMs to generate multiple intrinsic representations of input images, namely pixel (reconstruction), depth (geometry), and segmentation (structure), alongside standard visual understanding objectives. By synthesizing these diverse representations, UMMs capture complementary information regarding appearance, spatial relations, and structural layout. Consequently, UMMs develop a deeper and more comprehensive understanding of visual inputs. Extensive experiments across diverse UMM architectures demonstrate that our method notably enhances fine-grained perception, reduces hallucinations, and improves spatial understanding, while simultaneously boosting generation capabilities.

Keywords

Unified Multimodal Modelspost-training methodsvisual understandingvisual generationauxiliary generation tasksintrinsic representationspixel reconstructiondepth estimationsegmentationfine-grained perceptionhallucination reductionspatial understanding

More in Multimodal AI

View all
Generation Enhances Understanding in Unified Multimodal Models via Multi-Representation Generation | Paperchime