Multimodal AI

LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models

SShufan LiYYuchen ZhuJJiuxiang GuKKangning LiuZZhe LinYYongxin ChenMMolei TaoAAditya GroverJJason Kuen
Published
February 15, 2026
Authors
9
Word Count
14,854
Code
Includes code

LaViDa-R1 enables stable reasoning in multimodal diffusion models through unified training and guided generation.

Abstract

Diffusion language models (dLLMs) recently emerged as a promising alternative to auto-regressive LLMs. The latest works further extended it to multimodal understanding and generation tasks. In this work, we propose LaViDa-R1, a multimodal, general-purpose reasoning dLLM. Unlike existing works that build reasoning dLLMs through task-specific reinforcement learning, LaViDa-R1 incorporates diverse multimodal understanding and generation tasks in a unified manner. In particular, LaViDa-R1 is built with a novel unified post-training framework that seamlessly integrates supervised finetuning (SFT) and multi-task reinforcement learning (RL). It employs several novel training techniques, including answer-forcing, tree search, and complementary likelihood estimation, to enhance effectiveness and scalability. Extensive experiments demonstrate LaViDa-R1's strong performance on a wide range of multimodal tasks, including visual math reasoning, reason-intensive grounding, and image editing.

Key Takeaways

  • 1

    LaViDa-R1 enables reasoning in diffusion language models by unifying supervised finetuning, reinforcement learning, and self-distillation into one framework.

  • 2

    Replacing KL divergence regularization with supervised finetuning loss prevents training collapse in diffusion models while maintaining exploration.

  • 3

    Guided rollout generation solves the vanishing training signal problem when diffusion models fail to generate quality samples for difficult prompts.

Limitations

  • Existing reasoning approaches for diffusion models focus on task-specific training rather than unified multi-task learning frameworks.

  • KL divergence regularization causes high variance in diffusion models due to random token sampling in high-entropy image distributions.

Keywords

diffusion language modelsmultimodal understandingmultimodal generationunified post-training frameworksupervised fine-tuningmulti-task reinforcement learninganswer-forcingtree searchcomplementary likelihood estimationvisual math reasoningreason-intensive groundingimage editing

More in Multimodal AI

View all
LaViDa-R1: Advancing Reasoning for Unified Multimodal Diffusion Language Models | Paperchime