Fish Audio S2 Technical Report

SShijia LiaoYYuxuan WangSSongting LiuYYifan ChengRRuoyi ZhangTTianyu LiSShidong LiYYisheng ZhengXXingwei LiuQQingzheng WangZZhizhuo ZhouJJiahua LiuXXin ChenDDawei Han

Published: March 9, 2026
Authors: 14
Word Count: 10,366

View on arXiv Download PDF

Fish Audio S2 combines unified data pipelines with dual-autoregressive generation for controllable, expressive text-to-speech.

Abstract

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.

Key Takeaways

1
Fish Audio S2 uses a unified data pipeline where quality assessment and ASR models serve dual purposes: filtering training data and providing RL reward signals, eliminating distribution mismatch.
2
The Dual-Autoregressive architecture separates linguistic planning via a Slow AR from acoustic detail generation via a Fast AR, solving the dimensionality problem of ten stacked codebooks.
3
Fish Audio S2 achieves superior instruction-following, multi-speaker multi-turn generation, and stable long-form synthesis with production-ready streaming at 0.195 RTF and sub-100ms latency.

Limitations

The paper mentions that generating fine-grained natural-language instructions for vocal features at scale remains a major bottleneck in TTS development.
The script cuts off mid-sentence during the Fast AR architecture explanation, preventing complete assessment of all technical limitations discussed.

Keywords

text-to-speechmulti-speakermulti-turn generationinstruction-following controlnatural-language descriptionsmulti-stage trainingstaged data pipelinevideo captioningspeech captioningvoice-quality assessmentreward modelingSGLang-based inference engineRTFtime-to-first-audio

More in Speech & Audio AI

View all

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

Yuejie Li, Ke Yang +5

Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, m...

Feb 13134

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong, Kuangwei Chen +10

Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretr...

Feb 1147

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu +14

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice clonin...

Jan 2236

AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders

Georgii Aparin, Tasnima Sadekova +6

Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, prov...

Feb 428

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang +11

In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-A...

Jan 2921

More Speech & Audio AI papers