MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

MMiniCPM TeamWWenhao AnYYingfa ChenYYewei FangJJiayi LiXXin LiYYaohui LiYYishan LiYYuxuan LiBBiyuan LinCChuan LiuHHezi LiuSSiyuan LiuHHongya LyuYYinxu PanSShixin RenXXingyu ShenZZhou SuHHaojun SunYYangang SunZZhen Leng ThaiXXin TianRRui WangXXiaorong WangYYudong WangBBo WuXXiaoyue XuDDong XuSShuaikang XueJJiawei YangBBowen ZhangJJinqian ZhangLLetian ZhangSShengnan ZhangXXinyu ZhangXXinyuan ZhangZZhu ZhangHHengyu ZhaoJJiacheng ZhaoJJie ZhouZZihan ZhouSShuo WangCChaojun XiaoXXu HanZZhiyuan LiuMMaosong Sun

Published: February 12, 2026
Authors: 46
Word Count: 7,099

View on arXiv Download PDF

Hybrid sparse-linear attention enables efficient long-context LLM processing without sacrificing accuracy.

Abstract

The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.

Key Takeaways

1
MiniCPM-SALA combines sparse and linear attention in a 25/75 ratio to balance precision with efficiency.
2
Hybrid positional encoding removes rotary embeddings from sparse layers to prevent long-distance information decay.
3
The hybrid approach solves the sparse computation dense storage problem while maintaining contextual fidelity.

Limitations

Linear attention mechanisms compress contextual information lossily, causing information loss despite efficiency gains.
Sparse attention requires full KV-cache storage despite computing only partial attention matrices.

Keywords

large language modelsTransformer architecturesparse attentionlinear attentionhybrid architecturelayer selection algorithmhybrid positional encodingcontinual training frameworkinference speedsequence lengthtoken context

More in Large Language Models

View all

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Shaobo Wang, Xuan Ouyang +10

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic s...

Feb 5260

Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Zhongzhi Li, Xuansheng Wu +3

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity usi...

Feb 11204

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

Zixuan Huang, Xin Xia +12

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in ...

Feb 9170

Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives

Tengyue Xu, Zhuoyang Qian +17

Autonomous scientific discovery with large language model (LLM)-based agents has recently made substantial progress, demonstrating the ability to automate end-to-end research workflows. However, exist...

Jan 28143

Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs

Zhiyuan Hu, Yucheng Wang +8

Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: poli...

Jan 13129

More Large Language Models papers