Add vLLM usage example
Browse files
README.md
CHANGED
|
@@ -368,6 +368,58 @@ out = model.generate(**batch, **generate_kwargs)
|
|
| 368 |
|
| 369 |
## Additional Speed & Memory Improvements
|
| 370 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 371 |
### Flash Attention 2
|
| 372 |
|
| 373 |
If your GPU supports it and you are **not** using `torch.compile`, install Flash-Attention and enable it at load time:
|
|
|
|
| 368 |
|
| 369 |
## Additional Speed & Memory Improvements
|
| 370 |
|
| 371 |
+
### vLLM Inference (5-7x faster)
|
| 372 |
+
|
| 373 |
+
AF3 can now run with **vLLM** for significantly faster inference, **on average 5-7x speedup** vs standard Transformers generation.
|
| 374 |
+
|
| 375 |
+
Install:
|
| 376 |
+
|
| 377 |
+
```bash
|
| 378 |
+
VLLM_USE_PRECOMPILED=1 uv pip install -U --pre \
|
| 379 |
+
--override <(printf 'transformers>=5.0.0rc1\n') \
|
| 380 |
+
"vllm[audio] @ git+https://github.com/vllm-project/vllm.git"
|
| 381 |
+
```
|
| 382 |
+
|
| 383 |
+
Inference:
|
| 384 |
+
|
| 385 |
+
```python
|
| 386 |
+
import os
|
| 387 |
+
from pathlib import Path
|
| 388 |
+
|
| 389 |
+
from vllm import LLM, SamplingParams
|
| 390 |
+
|
| 391 |
+
os.environ["VLLM_ALLOW_LONG_MAX_MODEL_LEN"] = "1"
|
| 392 |
+
|
| 393 |
+
# audio_url = Path("./audio_file.mp3").expanduser().resolve().as_uri() # local file -> file://...
|
| 394 |
+
audio_url = "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3" # web URL -> https://...
|
| 395 |
+
|
| 396 |
+
prompt = "Transcribe the input speech."
|
| 397 |
+
|
| 398 |
+
llm = LLM(
|
| 399 |
+
model="nvidia/audio-flamingo-3-hf",
|
| 400 |
+
allowed_local_media_path=str(Path.cwd()),
|
| 401 |
+
max_model_len=20000,
|
| 402 |
+
)
|
| 403 |
+
sp = SamplingParams(max_tokens=4096, temperature=0.0, repetition_penalty=1.2)
|
| 404 |
+
|
| 405 |
+
print(
|
| 406 |
+
llm.chat(
|
| 407 |
+
[
|
| 408 |
+
{
|
| 409 |
+
"role": "user",
|
| 410 |
+
"content": [
|
| 411 |
+
{"type": "text", "text": prompt},
|
| 412 |
+
{"type": "audio_url", "audio_url": {"url": audio_url}},
|
| 413 |
+
],
|
| 414 |
+
}
|
| 415 |
+
],
|
| 416 |
+
sp,
|
| 417 |
+
)[0]
|
| 418 |
+
.outputs[0]
|
| 419 |
+
.text
|
| 420 |
+
)
|
| 421 |
+
```
|
| 422 |
+
|
| 423 |
### Flash Attention 2
|
| 424 |
|
| 425 |
If your GPU supports it and you are **not** using `torch.compile`, install Flash-Attention and enable it at load time:
|