Salama1429
/

s2-pro

@@ -91,6 +91,8 @@ tags:
 - text-to-speech
 - instruction-following
 - multilingual
 inference: false
 extra_gated_prompt: You agree to not use the model to generate contents that violate
   DMCA or local laws.
@@ -106,7 +108,9 @@ extra_gated_fields:
 [**Technical Report**](https://huggingface.co/papers/2603.08823) | [**GitHub**](https://github.com/fishaudio/fish-speech) | [**Playground**](https://fish.audio)
-**Fish Audio S2 Pro** is a leading text-to-speech (TTS) model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine.
 ## Architecture
@@ -137,17 +141,20 @@ S2 Pro supports 80+ languages.
 ## Production Streaming Performance
-On a single NVIDIA H200 GPU:
 - **Real-Time Factor (RTF):** 0.195
 - **Time-to-first-audio:** ~100 ms
 - **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5
 ## Links
 - [Fish Speech GitHub](https://github.com/fishaudio/fish-speech)
 - [Fish Audio Playground](https://fish.audio)
-- [Blog & Tech Report](https://fish.audio/blog/fish-audio-open-sources-s2/)
 ## Technical Report

 - text-to-speech
 - instruction-following
 - multilingual
+- multi-speaker
+- multi-turn
 inference: false
 extra_gated_prompt: You agree to not use the model to generate contents that violate
   DMCA or local laws.
 [**Technical Report**](https://huggingface.co/papers/2603.08823) | [**GitHub**](https://github.com/fishaudio/fish-speech) | [**Playground**](https://fish.audio)
+**Fish Audio S2 Pro** is a leading text-to-speech (TTS) system featuring multi-speaker, multi-turn generation and fine-grained inline control of prosody and emotion via natural-language descriptions.
+The model was trained on over 10M+ hours of audio data across 80+ languages using a multi-stage training recipe. This included a staged data pipeline covering video and speech captioning, voice-quality assessment, and reward modeling to enable robust reinforcement learning alignment.
 ## Architecture
 ## Production Streaming Performance
+On a single NVIDIA H200 GPU, the SGLang-based inference engine achieves:
 - **Real-Time Factor (RTF):** 0.195
 - **Time-to-first-audio:** ~100 ms
 - **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5
+## Authors
+Shijia Liao, Yuxuan Wang, Songting Liu, [Yifan Cheng](https://huggingface.co/WhaleDolphin), [Ruoyi Zhang](https://huggingface.co/PoTaTo721), Tianyu Li, Shidong Li, [Yisheng Zheng](https://huggingface.co/sfzys), Xingwei Liu, [Qingzheng Wang](https://huggingface.co/qingzhengwang), Zhizhuo Zhou, Jiahua Liu, Xin Chen, and Dawei Han.
 ## Links
 - [Fish Speech GitHub](https://github.com/fishaudio/fish-speech)
 - [Fish Audio Playground](https://fish.audio)
+- [Official Blog](https://fish.audio/blog/fish-audio-open-sources-s2/)
 ## Technical Report