LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

SShufan LiYYuchen ZhuJJiuxiang GuKKangning LiuZZhe LinYYongxin ChenMMolei TaoAAditya GroverJJason Kuen

Published: February 15, 2026
Authors: 9
Word Count: 14,854
Code: Includes code

LaViDa-R1 enables stable reasoning in multimodal diffusion models through unified training and guided generation.

Abstract

Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.

Key Takeaways

1
LaViDa-R1 enables reasoning in diffusion language models by unifying supervised finetuning, reinforcement learning, and self-distillation into one framework.
2
Replacing KL divergence regularization with supervised finetuning loss prevents training collapse in diffusion models while maintaining exploration.
3
Guided rollout generation solves the vanishing training signal problem when diffusion models fail to generate quality samples for difficult prompts.

Limitations

Existing reasoning approaches for diffusion models focus on task-specific training rather than unified multi-task learning frameworks.
KL divergence regularization causes high variance in diffusion models due to random token sampling in high-entropy image distributions.

Keywords

diffusion language modelsmultimodal understandingmultimodal generationunified post-training frameworksupervised fine-tuningmulti-task reinforcement learninganswer-forcingtree searchcomplementary likelihood estimationvisual math reasoningreason-intensive groundingimage editing

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers