Speech & Audio AI

Fish Audio S2 Technical Report

SShijia LiaoYYuxuan WangSSongting LiuYYifan ChengRRuoyi ZhangTTianyu LiSShidong LiYYisheng ZhengXXingwei LiuQQingzheng WangZZhizhuo ZhouJJiahua LiuXXin ChenDDawei Han
Published
March 9, 2026
Authors
14
Word Count
10,366

Fish Audio S2 combines unified data pipelines with dual-autoregressive generation for controllable, expressive text-to-speech.

Abstract

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.

Key Takeaways

  • 1

    Fish Audio S2 uses a unified data pipeline where quality assessment and ASR models serve dual purposes: filtering training data and providing RL reward signals, eliminating distribution mismatch.

  • 2

    The Dual-Autoregressive architecture separates linguistic planning via a Slow AR from acoustic detail generation via a Fast AR, solving the dimensionality problem of ten stacked codebooks.

  • 3

    Fish Audio S2 achieves superior instruction-following, multi-speaker multi-turn generation, and stable long-form synthesis with production-ready streaming at 0.195 RTF and sub-100ms latency.

Limitations

  • The paper mentions that generating fine-grained natural-language instructions for vocal features at scale remains a major bottleneck in TTS development.

  • The script cuts off mid-sentence during the Fast AR architecture explanation, preventing complete assessment of all technical limitations discussed.

Keywords

text-to-speechmulti-speakermulti-turn generationinstruction-following controlnatural-language descriptionsmulti-stage trainingstaged data pipelinevideo captioningspeech captioningvoice-quality assessmentreward modelingSGLang-based inference engineRTFtime-to-first-audio

More in Speech & Audio AI

View all
Fish Audio S2 Technical Report | Paperchime