Speech & Audio AI

Voxtral Realtime

AAlexander H. LiuAAndy EhrenbergAAndy LoCChen-Yo SunGGuillaume LampleJJean-Malo DelignonKKhyathi Raghavi ChanduPPatrick von PlatenPPavankumar Reddy MuddireddyRRohin AroraSSanchit GandhiSSandeep SubramanianSSoham GhoshSSrijan MishraAAbhinav RastogiAAlan JeffaresAAlbert JiangAAlexandre SablayrollesAAmélie HéliouAAndrew BaiAAngele LenglemetzAAnmol AgarwalAAnton EliseevAAntonia CalviAArjun MajumdarBBaptiste BoutBBaptiste RozièreBBaudouin De MonicaultBBenjamin TibiCClémence LanfranchiCConnor ChenCCorentin BarreauCCorentin SautierCCyprien CourtotDDarius DabertDDiego de las CasasEElliot Chane-SaneEEnguerrand PaquinFFaruk AhmedFFederico BaldassarreGGabrielle BerradaGGaëtan EcrepontGGauthier GuinetGGenevieve HayesGGeorgii NovikovGGiada PistilliGGuillaume MartinGGunjan DhanukaGGunshi GuptaHHan ZhouIIndraneel MukherjeeIIrene ZhangJJaeyoung KimJJan LudziejewskiJJason RuteJJoachim StudniaJJohn HarvillJJonas AmarJJosselin Somerville RobertsJJulien TauranKKarmesh YadavKKartik KhandelwalKKush JainLLaurence AitchisonLLéonard BlierLLingxiao ZhaoLLouis MartinLLucile SaulnierLLuyu GaoMMaarten BuylMManan SharmaMMargaret JenningsMMarie PellatMMark PrinsMMathieu PoiréeMMathilde GuillauminMMatthieu DinotMMatthieu FuteralMMaxime DarrinMMaximilian AugustinMMert UnsalMMia ChiquierNNathan GrinsztajnNNeha GuptaOOlivier BousquetOOlivier DuchennePPatricia WangPPaul JacobPPaul WamberguePPaula KurylowiczPPhilomène ChagniotPPierre StockPPiotr MiłośPPrateek GuptaPPravesh AgrawalQQuentin TorrobaRRam RamrakhyaRRishi ShahRRomain SauvestreRRoman SoletskyiRRosalie MillnerSSagar VazeSSamuel HumeauSSiddharth GandhiSSumukh AithalSSzymon AntoniakTTeven Le ScaoTThéo CachetTTheo Simon SorgTThibaut LavrilTThomas ChabalTThomas FoubertTThomas RobertTThomas WangTTim LawsonTTom BewleyTTom EdwardsTTyler WangVValeriia NemychnikovaVVan PhungVVedant NandaVVictor JouaultVVirgile RichardVVladislav BataevWWassim BouazizWWen-Ding LiWWilliam MarshallXXinghui LiXXingran GuoXXinyu YangYYannic NeuhausYYihan WangZZaccharie RamziZZhenlin Xu
Published
February 11, 2026
Authors
134
Word Count
7,532

Mistral's Voxtral Realtime delivers Whisper-quality speech recognition with sub-second streaming latency.

Abstract

We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.

Key Takeaways

  • 1

    Voxtral Realtime achieves offline-quality speech recognition at 480ms latency using native streaming architecture.

  • 2

    Delayed Streams Modeling with explicit audio-text alignment enables models to emit tokens based on acoustic evidence.

  • 3

    Causal encoder with modern architectural components outperforms adapted offline models for real-time transcription.

Limitations

  • Streaming models cannot see future audio context, unlike offline systems that process entire signals.

  • Sub-second latency targets create severe model degradation when adapting traditional offline architectures.

Keywords

automatic speech recognitionstreamingend-to-end trainingcausal audio encoderAda RMS-NormDelayed Streams Modelingpretraininglarge-scale datasetlatencyalignment

More in Speech & Audio AI

View all
Voxtral Realtime | Paperchime