Slow inference on rtx 3090

#16

by Blakus - opened 1 day ago

Hello, S2 - pro is slow on my 3090, even when using the --compile flag,
I get between 4 and 5 it/s, is this the expected speed?

2026-03-17 02:31:40.105 | INFO     | fish_speech.models.text2semantic.inference:generate_long:653 - Encoded prompt shape: torch.Size([11, 669])
  1%|█                                                                           | 474/32098 [01:28<1:38:56,  5.33it/s]
2026-03-17 02:33:10.021 | INFO     | fish_speech.models.text2semantic.inference:generate_long:682 - Compilation time: 89.96 seconds
2026-03-17 02:33:10.022 | INFO     | fish_speech.models.text2semantic.inference:generate_long:690 - Batch 0: Generated 476 tokens in 89.96 seconds, 5.29 tokens/sec
2026-03-17 02:33:10.022 | INFO     | fish_speech.models.text2semantic.inference:generate_long:694 - Bandwidth achieved: 24.14 GB/s
2026-03-17 02:33:10.023 | INFO     | fish_speech.models.text2semantic.inference:generate_long:720 - GPU Memory used: 22.15 GB

=== Generation Complete! ===

Maybe I should try using Sage Attention or Flash Attention 2?
i am on Windows btw.

Thanks in advance.

edwixx

1 day ago

it does have fa built in

GSherman

1 day ago

Hello, S2 - pro is slow on my 3090, even when using the --compile flag,
I get between 4 and 5 it/s, is this the expected speed?

2026-03-17 02:31:40.105 | INFO     | fish_speech.models.text2semantic.inference:generate_long:653 - Encoded prompt shape: torch.Size([11, 669])
  1%|█                                                                           | 474/32098 [01:28<1:38:56,  5.33it/s]
2026-03-17 02:33:10.021 | INFO     | fish_speech.models.text2semantic.inference:generate_long:682 - Compilation time: 89.96 seconds
2026-03-17 02:33:10.022 | INFO     | fish_speech.models.text2semantic.inference:generate_long:690 - Batch 0: Generated 476 tokens in 89.96 seconds, 5.29 tokens/sec
2026-03-17 02:33:10.022 | INFO     | fish_speech.models.text2semantic.inference:generate_long:694 - Bandwidth achieved: 24.14 GB/s
2026-03-17 02:33:10.023 | INFO     | fish_speech.models.text2semantic.inference:generate_long:720 - GPU Memory used: 22.15 GB

=== Generation Complete! ===

Maybe I should try using Sage Attention or Flash Attention 2?
i am on Windows btw.

Thanks in advance.

Yeah, I believe the minimum requirements are a 5090 with at least 24GB VRAM

GeoMaciolek

about 20 hours ago

I had similar performance on my 3090; something on the order of 5x slower than realtime, if I recall.

Re: GSherman's comment - the 3090 does have 24 GB of VRAM. (And the 5090 has 32 GB, so - I'm not sure what "a 5090 with at least 24 GB of VRAM" refers to other than a 5090.)

rusheedz

about 10 hours ago

i not sure bra

Hack337

about 3 hours ago

I get the same speed on my 3090

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment