NanoCodec Decoder - ONNX
ONNX-optimized decoder for the NeMo NanoCodec audio codec.
This model provides 2.5x faster inference compared to the PyTorch version for KaniTTS and similar TTS systems.
Model Details
- Model Type: Audio Codec Decoder
- Format: ONNX (Opset 14)
- Input: Token indices [batch, 4, num_frames]
- Output: Audio waveform [batch, samples] @ 22050 Hz
- Size: 122 MB
- Parameters: ~31.5M (decoder only, 15.8% of full model)
Performance
| Configuration | Decode Time/Frame | Speedup |
|---|---|---|
| PyTorch + GPU | ~92 ms | Baseline |
| ONNX + GPU | ~35 ms | 2.6x faster โจ |
| ONNX + CPU | ~60-80 ms | 1.2x faster |
Real-Time Factor (RTF): 0.44x on GPU (generates audio faster than playback!)
Quick Start
Installation
pip install onnxruntime-gpu numpy
For CPU-only:
pip install onnxruntime numpy
Usage
import numpy as np
import onnxruntime as ort
# Load model
session = ort.InferenceSession(
"nano_codec_decoder.onnx",
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
# Prepare input
tokens = np.random.randint(0, 500, (1, 4, 10), dtype=np.int64) # [batch, codebooks, frames]
tokens_len = np.array([10], dtype=np.int64)
# Run inference
outputs = session.run(
None,
{"tokens": tokens, "tokens_len": tokens_len}
)
audio, audio_len = outputs
print(f"Generated audio: {audio.shape}") # [1, 17640] samples
Integration with KaniTTS
from onnx_decoder_optimized import ONNXKaniTTSDecoderOptimized
# Initialize decoder
decoder = ONNXKaniTTSDecoderOptimized(
onnx_model_path="nano_codec_decoder.onnx",
device="cuda"
)
# Decode frame (4 codec tokens)
codes = [100, 200, 300, 400]
audio = decoder.decode_frame(codes) # Returns int16 numpy array
Model Architecture
The decoder consists of two stages:
Dequantization (FSQ): Converts token indices to latent representation
- Input: [batch, 4, frames] โ Output: [batch, 16, frames]
Audio Decoder (HiFiGAN): Generates audio from latents
- Input: [batch, 16, frames] โ Output: [batch, samples]
- Upsampling factor: ~1764x (80ms per frame at 22050 Hz)
Export Details
- Source Model: nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps
- Export Method: PyTorch โ ONNX (legacy exporter)
- Opset Version: 14
- Dynamic Axes: Frame dimension and audio samples
- Optimizations: Graph optimization enabled, constant folding
Use Cases
- Text-to-Speech Systems: Fast neural codec decoding
- Real-time Audio Generation: Sub-realtime performance on GPU
- Streaming TTS: Low-latency frame-by-frame decoding
- KaniTTS Integration: Drop-in replacement for PyTorch decoder
Requirements
GPU (Recommended)
- CUDA 11.8+ or 12.x
- cuDNN 8.x or 9.x
- ONNX Runtime GPU:
pip install onnxruntime-gpu
CPU
- Any modern CPU
- ONNX Runtime:
pip install onnxruntime
Inputs
tokens (int64): Codec token indices
- Shape:
[batch_size, 4, num_frames] - Range:
[0, 499](FSQ codebook indices)
- Shape:
tokens_len (int64): Number of frames
- Shape:
[batch_size] - Value: Number of frames in the sequence
- Shape:
Outputs
audio (float32): Generated audio waveform
- Shape:
[batch_size, num_samples] - Range:
[-1.0, 1.0] - Sample rate: 22050 Hz
- Shape:
audio_len (int64): Audio length
- Shape:
[batch_size] - Value: Number of audio samples
- Shape:
Accuracy
Compared to PyTorch reference implementation:
- Mean Absolute Error: 0.0087
- Correlation: 1.000000 (perfect)
- Relative Error: 0.0006%
Audio quality is virtually identical to PyTorch version.
Limitations
- Fixed sample rate (22050 Hz)
- Single-channel (mono) audio only
- Requires valid FSQ token indices (0-499 range)
- Best performance on NVIDIA GPUs with CUDA support
License
Apache 2.0 (same as source model)
Links
- Original Model: nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps
- KaniTTS: nineninesix/kani-tts-400m-en
- ONNX Runtime: onnxruntime.ai
Acknowledgments
- NVIDIA NeMo team for the original NanoCodec
- ONNX Runtime team for the inference engine
- KaniTTS team for the TTS system