MAEB: Massive Audio Embedding Benchmark

AAdnan El AssadiIIsaac ChungCChenghao XiaoRRoman SolomatinAAnimesh JhaRRahul ChandSSilky SinghKKaitlyn WangAAli Sartaz KhanMMarc Moussa NasserSSufen FongPPengfei HeAAlan XiaoAAyush Sunil MunotAAditya ShrivastavaAArtem GazizovNNiklas MuennighoffKKenneth Enevoldsen

Published: February 17, 2026
Authors: 18

View on arXiv Download PDF

Abstract

We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

Keywords

audio embedding benchmarkaudio-text reasoningmultilingual speech tasksacoustic understandinglinguistic tasksaudio encodersaudio large language modelsMTEB ecosystem

More in Speech & Audio AI

View all

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Yuejie Li, Ke Yang +5

Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, m...

Feb 13134

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong, Kuangwei Chen +10

Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretr...

Feb 1147

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu +14

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice clonin...

Jan 2236

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Georgii Aparin, Tasnima Sadekova +6

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, prov...

Feb 428

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang +11

In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-A...

Jan 2921

More Speech & Audio AI papers