Multimodal AI

STEP3-VL-10B Technical Report

AAilin HuangCChengyuan YaoCChunrui HanFFanqi WanHHangyu GuoHHaoran LvHHongyu ZhouJJia WangJJian ZhouJJianjian SunJJingcheng HuKKangheng LinLLiang ZhaoMMitt HuangSSong YuanWWenwen QuXXiangfeng WangYYanlin LaiYYingxiu ZhaoYYinmin ZhangYYukang ShiYYuyang ChenZZejia WengZZiyang MengAAng LiAAobo KongBBo DongCChangyi WanDDavid WangDDi QiDDingming LiEEn YuGGuopeng LiHHaiquan YinHHan ZhouHHanshan ZhangHHaolong YanHHebin ZhouHHongbo PengJJiaran ZhangJJiashu LvJJiayi FuJJie ChengJJie ZhouJJisheng YinJJingjing XieJJingwei WuJJun ZhangJJunfeng LiuKKaijun TanKKaiwen YanLLiangyu ChenLLina ChenMMingliang LiQQian ZhaoQQuan SunSShaoliang PangSShengjie FanSShijie ShangSSiyuan ZhangTTianhao YouWWei JiWWuxun XieXXiaobo YangXXiaojie HouXXiaoran JiaoXXiaoxiao RenXXiangwen KongXXin HuangXXin WuXXing ChenXXinran WangXXuelin ZhangYYana WeiYYang LiYYanming XuYYeqing ShenYYuang PengYYue PengYYu ZhouYYusheng LiYYuxiang YangYYuyang ZhangZZhe XieZZhewei HuangZZhenyi LuZZhimin FanZZihui ChengDDaxin JiangQQi HanXXiangyu ZhangYYibo ZhuZZheng Ge
arXiv ID
2601.09668
Published
January 14, 2026
Authors
93
Hugging Face Likes
169
Comments
6

Abstract

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10times-20times larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

More in Multimodal AI

View all
STEP3-VL-10B Technical Report | Paperchime