Large Language Models

ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation

ZZihao HuangJJundong ZhouXXingwei QuQQiyang MinGGe Zhang
Published
January 29, 2026
Authors
5
Word Count
8,166
Code
Includes code

Adaptive concept-level processing boosts LLM efficiency.

Abstract

Large language models allocate uniform computation across all tokens, ignoring that some sequences are trivially predictable while others require deep reasoning. We introduce ConceptMoE, which dynamically merges semantically similar tokens into concept representations, performing implicit token-level compute allocation. A learnable chunk module identifies optimal boundaries by measuring inter-token similarity, compressing sequences by a target ratio R before they enter the compute-intensive concept model. Crucially, the MoE architecture enables controlled evaluation: we reallocate saved computation to match baseline activated FLOPs (excluding attention map computation) and total parameters, isolating genuine architectural benefits. Under these conditions, ConceptMoE consistently outperforms standard MoE across language and vision-language tasks, achieving +0.9 points on language pretraining, +2.3 points on long context understanding, and +0.6 points on multimodal benchmarks. When converting pretrained MoE during continual training with layer looping, gains reach +5.5 points, demonstrating practical applicability. Beyond performance, ConceptMoE reduces attention computation by up to R^2times and KV cache by Rtimes. At R=2, empirical measurements show prefill speedups reaching 175\% and decoding speedups up to 117\% on long sequences. The minimal architectural modifications enable straightforward integration into existing MoE, demonstrating that adaptive concept-level processing fundamentally improves both effectiveness and efficiency of large language models.

Key Takeaways

  • 1

    Adaptive token-to-concept compression enhances model performance.

  • 2

    Significant speedups in prefill and decoding stages.

  • 3

    Dynamic compute allocation improves computational efficiency.

Limitations

  • Optimal compression ratio varies by dataset.

  • Requires additional computational resources for integration.

Keywords

ConceptMoEtoken-level compute allocationsemantically similar tokensconcept representationslearnable chunk moduleinter-token similarityMoE architectureattention computationKV cachelayer loopingprefill speedupsdecoding speedups

More in Large Language Models

View all
ConceptMoE: Adaptive Token-to-Concept Compression for Implicit Compute Allocation | Paperchime