Slow inference on rtx 3090

#16
by Blakus - opened

Hello, S2 - pro is slow on my 3090, even when using the --compile flag,
I get between 4 and 5 it/s, is this the expected speed?

2026-03-17 02:31:40.105 | INFO     | fish_speech.models.text2semantic.inference:generate_long:653 - Encoded prompt shape: torch.Size([11, 669])
  1%|█                                                                           | 474/32098 [01:28<1:38:56,  5.33it/s]
2026-03-17 02:33:10.021 | INFO     | fish_speech.models.text2semantic.inference:generate_long:682 - Compilation time: 89.96 seconds
2026-03-17 02:33:10.022 | INFO     | fish_speech.models.text2semantic.inference:generate_long:690 - Batch 0: Generated 476 tokens in 89.96 seconds, 5.29 tokens/sec
2026-03-17 02:33:10.022 | INFO     | fish_speech.models.text2semantic.inference:generate_long:694 - Bandwidth achieved: 24.14 GB/s
2026-03-17 02:33:10.023 | INFO     | fish_speech.models.text2semantic.inference:generate_long:720 - GPU Memory used: 22.15 GB

=== Generation Complete! ===

Maybe I should try using Sage Attention or Flash Attention 2?
i am on Windows btw.

Thanks in advance.

it does have fa built in

Hello, S2 - pro is slow on my 3090, even when using the --compile flag,
I get between 4 and 5 it/s, is this the expected speed?

2026-03-17 02:31:40.105 | INFO     | fish_speech.models.text2semantic.inference:generate_long:653 - Encoded prompt shape: torch.Size([11, 669])
  1%|█                                                                           | 474/32098 [01:28<1:38:56,  5.33it/s]
2026-03-17 02:33:10.021 | INFO     | fish_speech.models.text2semantic.inference:generate_long:682 - Compilation time: 89.96 seconds
2026-03-17 02:33:10.022 | INFO     | fish_speech.models.text2semantic.inference:generate_long:690 - Batch 0: Generated 476 tokens in 89.96 seconds, 5.29 tokens/sec
2026-03-17 02:33:10.022 | INFO     | fish_speech.models.text2semantic.inference:generate_long:694 - Bandwidth achieved: 24.14 GB/s
2026-03-17 02:33:10.023 | INFO     | fish_speech.models.text2semantic.inference:generate_long:720 - GPU Memory used: 22.15 GB

=== Generation Complete! ===

Maybe I should try using Sage Attention or Flash Attention 2?
i am on Windows btw.

Thanks in advance.

Yeah, I believe the minimum requirements are a 5090 with at least 24GB VRAM

I had similar performance on my 3090; something on the order of 5x slower than realtime, if I recall.

Re: GSherman's comment - the 3090 does have 24 GB of VRAM. (And the 5090 has 32 GB, so - I'm not sure what "a 5090 with at least 24 GB of VRAM" refers to other than a 5090.)

i not sure bra

I get the same speed on my 3090

Sign up or log in to comment