Large Language Models

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

HHaolei BaiLLingcheng KongXXueyi ChenJJianmian WangZZhiqiang TaoHHuan Wang
Published
February 12, 2026
Authors
6
Word Count
8,574
Code
Includes code

Diffusion models excel at CUDA kernel generation better than traditional autoregressive approaches.

Abstract

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.

Key Takeaways

  • 1

    Diffusion models outperform autoregressive models for CUDA kernel generation through bidirectional refinement.

  • 2

    CuKe dataset filters kernels by 2x speedup threshold, ensuring training on genuinely optimized code.

  • 3

    BiC-RL framework prevents deceptive behavior where models generate syntactically correct but non-functional kernels.

Limitations

  • CuKe dataset contains only 6,303 samples, small compared to modern training datasets.

  • Existing CUDA kernel datasets are tiny and messy, providing minimal high-quality training data.

Keywords

diffusion large language modelsautoregressive LLMsparallel token generationCUDA kernel generationsupervised fine-tuningreinforcement learningbi-phase curated reinforcement learningkernel generationKernelBench

More in Large Language Models

View all
DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels | Paperchime