Efficient AI

SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning

QQifan YuXXinyu MaZZhijian ZhuoMMinrui WangDDeyi LiuSShiyi ZhanYYiyuan MaLLiang XiangXXingyan BinDDi He
Published
February 2, 2026
Authors
10
Word Count
9,511

Efficient mid-stage width expansion for large models.

Abstract

Progressive Learning (PL) reduces pre-training computational overhead by gradually increasing model scale. While prior work has extensively explored depth expansion, width expansion remains significantly understudied, with the few existing methods limited to the early stages of training. However, expanding width during the mid-stage is essential for maximizing computational savings, yet it remains a formidable challenge due to severe training instabilities. Empirically, we show that naive initialization at this stage disrupts activation statistics, triggering loss spikes, while copy-based initialization introduces gradient symmetry that hinders feature diversity. To address these issues, we propose SPARKLING (balancing {S}ignal {P}reservation {A}nd symmet{R}y brea{K}ing for width-progressive {L}earn{ING}), a novel framework for mid-stage width expansion. Our method achieves signal preservation via RMS-scale consistency, stabilizing activation statistics during expansion. Symmetry breaking is ensured through asymmetric optimizer state resetting and learning rate re-warmup. Extensive experiments on Mixture-of-Experts (MoE) models demonstrate that, across multiple width axes and optimizer families, SPARKLING consistently outperforms training from scratch and reduces training cost by up to 35% under 2times width expansion.

Key Takeaways

  • 1

    SPARKLING enables stable mid-stage width expansion.

  • 2

    Saves up to 35% in training costs.

  • 3

    Balances signal preservation and symmetry breaking.

Limitations

  • Relies on RMS-scale consistency assumption.

  • Requires careful hyperparameter tuning.

Keywords

progressive learningwidth expansiontraining instabilitiesactivation statisticsloss spikesgradient symmetrysignal preservationRMS-scale consistencyasymmetric optimizer state resettinglearning rate re-warmupMixture-of-Expertscomputational savings

More in Efficient AI

View all
SPARKLING: Balancing Signal Preservation and Symmetry Breaking for Width-Progressive Learning | Paperchime