Multimodal AI

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

BBaorong ShiBBo CuiBBoyuan JiangDDeli YuFFang QianHHaihua YangHHuichao WangJJiale ChenJJianfei PanJJieqiong CaoJJinghao LinKKai WuLLin YangSShengsheng YaoTTao ChenXXiaojun XiaoXXiaozhong JiXXu WangYYijun HeZZhixiong Yang
Published
February 13, 2026
Authors
20

Abstract

We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.

Keywords

vision-language foundation modelentity-aware continual pretrainingheterogeneous medical corporalong-tail gapsreinforcement learningtool-augmented agentic trainingmulti-step diagnostic reasoningevidence-grounded reasoninghallucination reductionmedical instruction adherence

More in Multimodal AI

View all
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs | Paperchime