Link paper and add author information
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -91,6 +91,8 @@ tags:
|
|
| 91 |
- text-to-speech
|
| 92 |
- instruction-following
|
| 93 |
- multilingual
|
|
|
|
|
|
|
| 94 |
inference: false
|
| 95 |
extra_gated_prompt: You agree to not use the model to generate contents that violate
|
| 96 |
DMCA or local laws.
|
|
@@ -106,7 +108,9 @@ extra_gated_fields:
|
|
| 106 |
|
| 107 |
[**Technical Report**](https://huggingface.co/papers/2603.08823) | [**GitHub**](https://github.com/fishaudio/fish-speech) | [**Playground**](https://fish.audio)
|
| 108 |
|
| 109 |
-
**Fish Audio S2 Pro** is a leading text-to-speech (TTS)
|
|
|
|
|
|
|
| 110 |
|
| 111 |
## Architecture
|
| 112 |
|
|
@@ -137,17 +141,20 @@ S2 Pro supports 80+ languages.
|
|
| 137 |
|
| 138 |
## Production Streaming Performance
|
| 139 |
|
| 140 |
-
On a single NVIDIA H200 GPU:
|
| 141 |
|
| 142 |
- **Real-Time Factor (RTF):** 0.195
|
| 143 |
- **Time-to-first-audio:** ~100 ms
|
| 144 |
- **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5
|
| 145 |
|
|
|
|
|
|
|
|
|
|
| 146 |
## Links
|
| 147 |
|
| 148 |
- [Fish Speech GitHub](https://github.com/fishaudio/fish-speech)
|
| 149 |
- [Fish Audio Playground](https://fish.audio)
|
| 150 |
-
- [
|
| 151 |
|
| 152 |
## Technical Report
|
| 153 |
|
|
|
|
| 91 |
- text-to-speech
|
| 92 |
- instruction-following
|
| 93 |
- multilingual
|
| 94 |
+
- multi-speaker
|
| 95 |
+
- multi-turn
|
| 96 |
inference: false
|
| 97 |
extra_gated_prompt: You agree to not use the model to generate contents that violate
|
| 98 |
DMCA or local laws.
|
|
|
|
| 108 |
|
| 109 |
[**Technical Report**](https://huggingface.co/papers/2603.08823) | [**GitHub**](https://github.com/fishaudio/fish-speech) | [**Playground**](https://fish.audio)
|
| 110 |
|
| 111 |
+
**Fish Audio S2 Pro** is a leading text-to-speech (TTS) system featuring multi-speaker, multi-turn generation and fine-grained inline control of prosody and emotion via natural-language descriptions.
|
| 112 |
+
|
| 113 |
+
The model was trained on over 10M+ hours of audio data across 80+ languages using a multi-stage training recipe. This included a staged data pipeline covering video and speech captioning, voice-quality assessment, and reward modeling to enable robust reinforcement learning alignment.
|
| 114 |
|
| 115 |
## Architecture
|
| 116 |
|
|
|
|
| 141 |
|
| 142 |
## Production Streaming Performance
|
| 143 |
|
| 144 |
+
On a single NVIDIA H200 GPU, the SGLang-based inference engine achieves:
|
| 145 |
|
| 146 |
- **Real-Time Factor (RTF):** 0.195
|
| 147 |
- **Time-to-first-audio:** ~100 ms
|
| 148 |
- **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5
|
| 149 |
|
| 150 |
+
## Authors
|
| 151 |
+
Shijia Liao, Yuxuan Wang, Songting Liu, [Yifan Cheng](https://huggingface.co/WhaleDolphin), [Ruoyi Zhang](https://huggingface.co/PoTaTo721), Tianyu Li, Shidong Li, [Yisheng Zheng](https://huggingface.co/sfzys), Xingwei Liu, [Qingzheng Wang](https://huggingface.co/qingzhengwang), Zhizhuo Zhou, Jiahua Liu, Xin Chen, and Dawei Han.
|
| 152 |
+
|
| 153 |
## Links
|
| 154 |
|
| 155 |
- [Fish Speech GitHub](https://github.com/fishaudio/fish-speech)
|
| 156 |
- [Fish Audio Playground](https://fish.audio)
|
| 157 |
+
- [Official Blog](https://fish.audio/blog/fish-audio-open-sources-s2/)
|
| 158 |
|
| 159 |
## Technical Report
|
| 160 |
|