Voxtral Realtime

AAlexander H. LiuAAndy EhrenbergAAndy LoCChen-Yo SunGGuillaume LampleJJean-Malo DelignonKKhyathi Raghavi ChanduPPatrick von PlatenPPavankumar Reddy MuddireddyRRohin AroraSSanchit GandhiSSandeep SubramanianSSoham GhoshSSrijan MishraAAbhinav RastogiAAlan JeffaresAAlbert JiangAAlexandre SablayrollesAAmélie HéliouAAndrew BaiAAngele LenglemetzAAnmol AgarwalAAnton EliseevAAntonia CalviAArjun MajumdarBBaptiste BoutBBaptiste RozièreBBaudouin De MonicaultBBenjamin TibiCClémence LanfranchiCConnor ChenCCorentin BarreauCCorentin SautierCCyprien CourtotDDarius DabertDDiego de las CasasEElliot Chane-SaneEEnguerrand PaquinFFaruk AhmedFFederico BaldassarreGGabrielle BerradaGGaëtan EcrepontGGauthier GuinetGGenevieve HayesGGeorgii NovikovGGiada PistilliGGuillaume MartinGGunjan DhanukaGGunshi GuptaHHan ZhouIIndraneel MukherjeeIIrene ZhangJJaeyoung KimJJan LudziejewskiJJason RuteJJoachim StudniaJJohn HarvillJJonas AmarJJosselin Somerville RobertsJJulien TauranKKarmesh YadavKKartik KhandelwalKKush JainLLaurence AitchisonLLéonard BlierLLingxiao ZhaoLLouis MartinLLucile SaulnierLLuyu GaoMMaarten BuylMManan SharmaMMargaret JenningsMMarie PellatMMark PrinsMMathieu PoiréeMMathilde GuillauminMMatthieu DinotMMatthieu FuteralMMaxime DarrinMMaximilian AugustinMMert UnsalMMia ChiquierNNathan GrinsztajnNNeha GuptaOOlivier BousquetOOlivier DuchennePPatricia WangPPaul JacobPPaul WamberguePPaula KurylowiczPPhilomène ChagniotPPierre StockPPiotr MiłośPPrateek GuptaPPravesh AgrawalQQuentin TorrobaRRam RamrakhyaRRishi ShahRRomain SauvestreRRoman SoletskyiRRosalie MillnerSSagar VazeSSamuel HumeauSSiddharth GandhiSSumukh AithalSSzymon AntoniakTTeven Le ScaoTThéo CachetTTheo Simon SorgTThibaut LavrilTThomas ChabalTThomas FoubertTThomas RobertTThomas WangTTim LawsonTTom BewleyTTom EdwardsTTyler WangVValeriia NemychnikovaVVan PhungVVedant NandaVVictor JouaultVVirgile RichardVVladislav BataevWWassim BouazizWWen-Ding LiWWilliam MarshallXXinghui LiXXingran GuoXXinyu YangYYannic NeuhausYYihan WangZZaccharie RamziZZhenlin Xu

Published: February 11, 2026
Authors: 134
Word Count: 7,532

View on arXiv Download PDF

Mistral's Voxtral Realtime delivers Whisper-quality speech recognition with sub-second streaming latency.

Abstract

We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.

Key Takeaways

1
Voxtral Realtime achieves offline-quality speech recognition at 480ms latency using native streaming architecture.
2
Delayed Streams Modeling with explicit audio-text alignment enables models to emit tokens based on acoustic evidence.
3
Causal encoder with modern architectural components outperforms adapted offline models for real-time transcription.

Limitations

Streaming models cannot see future audio context, unlike offline systems that process entire signals.
Sub-second latency targets create severe model degradation when adapting traditional offline architectures.

Keywords

automatic speech recognitionstreamingend-to-end trainingcausal audio encoderAda RMS-NormDelayed Streams Modelingpretraininglarge-scale datasetlatencyalignment

More in Speech & Audio AI

View all

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Yuejie Li, Ke Yang +5

Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, m...

Feb 13134

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong, Kuangwei Chen +10

Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretr...

Feb 1147

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu +14

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice clonin...

Jan 2236

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Georgii Aparin, Tasnima Sadekova +6

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, prov...

Feb 428

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang +11

In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-A...

Jan 2921

More Speech & Audio AI papers