nvidia
/

audio-flamingo-3-hf

Audio-Text-to-Text

text2text-generation

audio understanding

Model card Files Files and versions

SreyanG-NVIDIA commited on 4 days ago

Commit

1149626

·

verified ·

1 Parent(s): 1b7715c

Add vLLM usage example

Files changed (1) hide show

README.md +52 -0

README.md CHANGED Viewed

@@ -368,6 +368,58 @@ out = model.generate(**batch, **generate_kwargs)
 ## Additional Speed & Memory Improvements
 ### Flash Attention 2
 If your GPU supports it and you are **not** using `torch.compile`, install Flash-Attention and enable it at load time:

 ## Additional Speed & Memory Improvements
+### vLLM Inference (5-7x faster)
+AF3 can now run with **vLLM** for significantly faster inference, **on average 5-7x speedup** vs standard Transformers generation.
+Install:
+```bash
+VLLM_USE_PRECOMPILED=1 uv pip install -U --pre \
+  --override <(printf 'transformers>=5.0.0rc1\n') \
+  "vllm[audio] @ git+https://github.com/vllm-project/vllm.git"
+```
+Inference:
+```python
+import os
+from pathlib import Path
+from vllm import LLM, SamplingParams
+os.environ["VLLM_ALLOW_LONG_MAX_MODEL_LEN"] = "1"
+# audio_url = Path("./audio_file.mp3").expanduser().resolve().as_uri()   # local file -> file://...
+audio_url = "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3"  # web URL -> https://...
+prompt = "Transcribe the input speech."
+llm = LLM(
+    model="nvidia/audio-flamingo-3-hf",
+    allowed_local_media_path=str(Path.cwd()),
+    max_model_len=20000,
+)
+sp = SamplingParams(max_tokens=4096, temperature=0.0, repetition_penalty=1.2)
+print(
+    llm.chat(
+        [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": prompt},
+                    {"type": "audio_url", "audio_url": {"url": audio_url}},
+                ],
+            }
+        ],
+        sp,
+    )[0]
+    .outputs[0]
+    .text
+)
+```
 ### Flash Attention 2
 If your GPU supports it and you are **not** using `torch.compile`, install Flash-Attention and enable it at load time: