Beyond Language Modeling: An Exploration of Multimodal Pretraining

SShengbang TongDDavid FanJJohn NguyenEEllis BrownGGaoyue ZhouSShengyi QianBBoyang ZhengTThéophane VallaeysJJunlin HanRRob FergusNNaila MurrayMMarjan GhazvininejadMMike LewisNNicolas BallasAAmir BarMMichael RabbatJJakob VerbeekLLuke ZettlemoyerKKoustuv SinhaYYann LeCunSSaining Xie

Published: March 3, 2026
Authors: 21
Word Count: 13,426

View on arXiv Download PDF

Unified multimodal pretraining reveals vision and language naturally synergize without competing for capacity.

Abstract

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

Key Takeaways

1
Representation Autoencoders outperform VAEs for both visual understanding and generation tasks in unified multimodal models.
2
Vision and language data are complementary with positive synergy, contradicting assumptions about competing model capacity.
3
Mixture-of-Experts architectures efficiently handle vision's higher data hunger compared to language during multimodal scaling.

Limitations

Study limited to from-scratch pretraining without instruction tuning evaluation on downstream tasks.
Scaling asymmetry between vision and language remains a fundamental challenge despite MoE solutions.

Keywords

multimodal modelsTransfusion frameworknext-token predictiondiffusionrepresentation Autoencodervisual understandingvisual generationworld modelingMixture-of-Expertsscaling lawsIsoFLOP analysisdata complementaritymodality specialization

More in Multimodal AI

View all

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li +22

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Re...

Feb 26190

ERNIE 5.0 Technical Report

Haifeng Wang, Hua Wu +436

In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are...

Feb 4190

STEP3-VL-10B Technical Report

Ailin Huang, Chengyuan Yao +91

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized t...

Jan 14169

Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang, Yi Wang +5

As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current ...

Jan 15151

From Blind Spots to Gains: Diagnostic-Driven Iterative Training for Large Multimodal Models

Hongrui Jia, Chaoya Jiang +2

As Large Multimodal Models (LMMs) scale up and reinforcement learning (RL) methods mature, LMMs have made notable progress in complex reasoning and decision making. Yet training still relies on static...

Feb 26148

More Multimodal AI papers