Large Language Models

Scaling Embeddings Outperforms Scaling Experts in Language Models

HHong LiuJJiaqi ZhangCChao WangXXing HuLLinkun LyuJJiaqi SunXXurui YangBBo WangFFengcun LiYYulei QianLLingtong SiYYerui SunRRumei LiPPeng PeiYYuchen XieXXunliang Cai
Published
January 29, 2026
Authors
16
Word Count
8,175

Embedding scaling beats MoE in large language models.

Abstract

While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.

Key Takeaways

  • 1

    Embedding scaling outperforms MoE in sparse, wide models.

  • 2

    N-gram Embedding captures richer contextual information.

  • 3

    Optimal performance with high sparsity and parameter budgeting.

Limitations

  • Requires substantial computational resources for training.

  • Effectiveness depends on careful hyperparameter tuning.

Keywords

Mixture-of-Expertssparsity scalingembedding scalingPareto frontierparameter budgetingmodel widthmodel depthsystem optimizationsspeculative decodingLongCat-Flash-Lite

More in Large Language Models

View all
Scaling Embeddings Outperforms Scaling Experts in Language Models | Paperchime