Link paper and add author information

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +10 -3
README.md CHANGED
@@ -91,6 +91,8 @@ tags:
91
  - text-to-speech
92
  - instruction-following
93
  - multilingual
 
 
94
  inference: false
95
  extra_gated_prompt: You agree to not use the model to generate contents that violate
96
  DMCA or local laws.
@@ -106,7 +108,9 @@ extra_gated_fields:
106
 
107
  [**Technical Report**](https://huggingface.co/papers/2603.08823) | [**GitHub**](https://github.com/fishaudio/fish-speech) | [**Playground**](https://fish.audio)
108
 
109
- **Fish Audio S2 Pro** is a leading text-to-speech (TTS) model with fine-grained inline control of prosody and emotion. Trained on over 10M+ hours of audio data across 80+ languages, the system combines reinforcement learning alignment with a dual-autoregressive architecture. The release includes model weights, fine-tuning code, and an SGLang-based streaming inference engine.
 
 
110
 
111
  ## Architecture
112
 
@@ -137,17 +141,20 @@ S2 Pro supports 80+ languages.
137
 
138
  ## Production Streaming Performance
139
 
140
- On a single NVIDIA H200 GPU:
141
 
142
  - **Real-Time Factor (RTF):** 0.195
143
  - **Time-to-first-audio:** ~100 ms
144
  - **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5
145
 
 
 
 
146
  ## Links
147
 
148
  - [Fish Speech GitHub](https://github.com/fishaudio/fish-speech)
149
  - [Fish Audio Playground](https://fish.audio)
150
- - [Blog & Tech Report](https://fish.audio/blog/fish-audio-open-sources-s2/)
151
 
152
  ## Technical Report
153
 
 
91
  - text-to-speech
92
  - instruction-following
93
  - multilingual
94
+ - multi-speaker
95
+ - multi-turn
96
  inference: false
97
  extra_gated_prompt: You agree to not use the model to generate contents that violate
98
  DMCA or local laws.
 
108
 
109
  [**Technical Report**](https://huggingface.co/papers/2603.08823) | [**GitHub**](https://github.com/fishaudio/fish-speech) | [**Playground**](https://fish.audio)
110
 
111
+ **Fish Audio S2 Pro** is a leading text-to-speech (TTS) system featuring multi-speaker, multi-turn generation and fine-grained inline control of prosody and emotion via natural-language descriptions.
112
+
113
+ The model was trained on over 10M+ hours of audio data across 80+ languages using a multi-stage training recipe. This included a staged data pipeline covering video and speech captioning, voice-quality assessment, and reward modeling to enable robust reinforcement learning alignment.
114
 
115
  ## Architecture
116
 
 
141
 
142
  ## Production Streaming Performance
143
 
144
+ On a single NVIDIA H200 GPU, the SGLang-based inference engine achieves:
145
 
146
  - **Real-Time Factor (RTF):** 0.195
147
  - **Time-to-first-audio:** ~100 ms
148
  - **Throughput:** 3,000+ acoustic tokens/s while maintaining RTF below 0.5
149
 
150
+ ## Authors
151
+ Shijia Liao, Yuxuan Wang, Songting Liu, [Yifan Cheng](https://huggingface.co/WhaleDolphin), [Ruoyi Zhang](https://huggingface.co/PoTaTo721), Tianyu Li, Shidong Li, [Yisheng Zheng](https://huggingface.co/sfzys), Xingwei Liu, [Qingzheng Wang](https://huggingface.co/qingzhengwang), Zhizhuo Zhou, Jiahua Liu, Xin Chen, and Dawei Han.
152
+
153
  ## Links
154
 
155
  - [Fish Speech GitHub](https://github.com/fishaudio/fish-speech)
156
  - [Fish Audio Playground](https://fish.audio)
157
+ - [Official Blog](https://fish.audio/blog/fish-audio-open-sources-s2/)
158
 
159
  ## Technical Report
160