SreyanG-NVIDIA commited on
Commit
1149626
·
verified ·
1 Parent(s): 1b7715c

Add vLLM usage example

Browse files
Files changed (1) hide show
  1. README.md +52 -0
README.md CHANGED
@@ -368,6 +368,58 @@ out = model.generate(**batch, **generate_kwargs)
368
 
369
  ## Additional Speed & Memory Improvements
370
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
371
  ### Flash Attention 2
372
 
373
  If your GPU supports it and you are **not** using `torch.compile`, install Flash-Attention and enable it at load time:
 
368
 
369
  ## Additional Speed & Memory Improvements
370
 
371
+ ### vLLM Inference (5-7x faster)
372
+
373
+ AF3 can now run with **vLLM** for significantly faster inference, **on average 5-7x speedup** vs standard Transformers generation.
374
+
375
+ Install:
376
+
377
+ ```bash
378
+ VLLM_USE_PRECOMPILED=1 uv pip install -U --pre \
379
+ --override <(printf 'transformers>=5.0.0rc1\n') \
380
+ "vllm[audio] @ git+https://github.com/vllm-project/vllm.git"
381
+ ```
382
+
383
+ Inference:
384
+
385
+ ```python
386
+ import os
387
+ from pathlib import Path
388
+
389
+ from vllm import LLM, SamplingParams
390
+
391
+ os.environ["VLLM_ALLOW_LONG_MAX_MODEL_LEN"] = "1"
392
+
393
+ # audio_url = Path("./audio_file.mp3").expanduser().resolve().as_uri() # local file -> file://...
394
+ audio_url = "https://huggingface.co/datasets/nvidia/AudioSkills/resolve/main/assets/WhDJDIviAOg_120_10.mp3" # web URL -> https://...
395
+
396
+ prompt = "Transcribe the input speech."
397
+
398
+ llm = LLM(
399
+ model="nvidia/audio-flamingo-3-hf",
400
+ allowed_local_media_path=str(Path.cwd()),
401
+ max_model_len=20000,
402
+ )
403
+ sp = SamplingParams(max_tokens=4096, temperature=0.0, repetition_penalty=1.2)
404
+
405
+ print(
406
+ llm.chat(
407
+ [
408
+ {
409
+ "role": "user",
410
+ "content": [
411
+ {"type": "text", "text": prompt},
412
+ {"type": "audio_url", "audio_url": {"url": audio_url}},
413
+ ],
414
+ }
415
+ ],
416
+ sp,
417
+ )[0]
418
+ .outputs[0]
419
+ .text
420
+ )
421
+ ```
422
+
423
  ### Flash Attention 2
424
 
425
  If your GPU supports it and you are **not** using `torch.compile`, install Flash-Attention and enable it at load time: