snorbyte/indic-tts-sample-snac-encoded
Viewer • Updated • 67.7k • 9 • 3
How to use devnagriai/snorTTS-Indic-v0-AWQ-W4A16 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-to-speech", model="devnagriai/snorTTS-Indic-v0-AWQ-W4A16") # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("devnagriai/snorTTS-Indic-v0-AWQ-W4A16")
model = AutoModelForCausalLM.from_pretrained("devnagriai/snorTTS-Indic-v0-AWQ-W4A16")This is a quantized version of snorbyte/snorTTS-Indic-v0 using AWQ (Activation-aware Weight Quantization) with W4A16 precision.
| Parameter | Value |
|---|---|
| Method | AWQ (Activation-aware Weight Quantization) |
| Weight Precision | 4-bit |
| Activation Precision | 16-bit |
| Format | compressed-tensors |
| Quantization Tool | llmcompressor |
| Model Size Reduction | ~75% |
| Calibration Samples | 512 |
| Calibration Dataset | snorbyte/indic-tts-sample-snac-encoded |
| Metric | Original Model | This Model (AWQ) |
|---|---|---|
| Model Size | ~8GB | ~3.5GB (60% reduction) |
| Inference Speed | Baseline | Faster (4-bit computation) |
| Memory Usage | High | Low |
| Audio Quality | Reference | Minimal degradation |
docker run \
--runtime nvidia \
--gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v ~/.cache/vllm:/root/.cache/vllm \
-v ~/snor-quant:/models \
-p 8002:8002 \
--env "HF_HUB_ENABLE_HF_TRANSFER=1" \
--env "HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}" \
--env "HF_HUB_OFFLINE=1" \
--ipc=host \
--shm-size 32g \
--log-opt max-size=10m \
--log-opt max-file=3 \
vllm/vllm-openai:latest \
--port 8002 \
--model "/models/snorTTS-Indic-v0-AWQ-W4A16" \
--served-model-name llm \
--host 0.0.0.0 \
--max-model-len 2048 \
--max-num-seqs 5 \
--gpu-memory-utilization 0.20 \
--dtype auto \
--quantization compressed-tensors \
--trust-remote-code \
--uvicorn-log-level info
Unable to build the model tree, the base model loops to the model itself. Learn more.