Latest Speech & Audio AI Research Papers

Research on speech recognition, text-to-speech, audio processing, and voice AI technologies.

5 Papers
Showing 5 of 5 papers

Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis

Thanathai Lertpetchpun, Yoonjeong Lee, Thanapat Trachu +4 more

Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific acce...

text-to-speechspeaker embeddingsphonological rulesaccent controlphoneme shift rate+3 more
Jan 20, 20265

PRiSM: Benchmarking Phone Realization in Speech Models

Shikhar Bharadwaj, Chin-Jou Li, Yoonjae Kim +13 more

Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first ope...

phone recognitioncross-lingual speech processingphonetic analysisintrinsic evaluationextrinsic evaluation+10 more
Jan 20, 20265

Typhoon ASR Real-time: FastConformer-Transducer for Thai Automatic Speech Recognition

Warit Sirichotedumrong, Adisai Na-Thalang, Potsawee Manakul +3 more

Large encoder-decoder models like Whisper achieve strong offline transcription but remain impractical for streaming applications due to high latency. However, due to the accessibility of pre-trained checkpoints, the open Thai ASR landscape remains dominated by these offline architectures, leaving a ...

FastConformer-Transducerstreaming applicationstext normalizationcurriculum learningThai ASR+5 more
Jan 19, 202611

FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

Tanyu Chen, Tairan Chen, Kai Shen +4 more

Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we pr...

speech tokenizersneural audio codecsLLMsend-to-end spoken dialogue systemsvoice cloning+5 more
Jan 16, 202618
Latest Speech & Audio AI Research | Speech & Audio AI Papers