Large Language Models

MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration

LLianhai RenYYucheng DingXXiao LiuQQianxiao LiPPeng ChengYYeyun Gong
Published
February 2, 2026
Authors
6
Word Count
9,430
Code
Includes code

MSign stabilizes LLM training via periodic matrix resets.

Abstract

Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via μP, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.

Key Takeaways

  • 1

    MSign optimizer prevents training instability in LLMs.

  • 2

    Matrix sign operations restore stable rank of weights.

  • 3

    MSign maintains stable convergence across various model sizes.

Limitations

  • Optimal frequency of MSign application requires tuning.

  • Computational overhead of periodic matrix operations.

Keywords

large language modelpretraininggradient explosionsweight matrix stable rankFrobenius normspectral normJacobianmatrix sign operationsoptimizer

More in Large Language Models

View all
MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration | Paperchime