NanoCodec Decoder - ONNX

ONNX-optimized decoder for the NeMo NanoCodec audio codec.

This model provides 2.5x faster inference compared to the PyTorch version for KaniTTS and similar TTS systems.

Model Details

Model Type: Audio Codec Decoder
Format: ONNX (Opset 14)
Input: Token indices [batch, 4, num_frames]
Output: Audio waveform [batch, samples] @ 22050 Hz
Size: 122 MB
Parameters: ~31.5M (decoder only, 15.8% of full model)

Performance

Configuration	Decode Time/Frame	Speedup
PyTorch + GPU	~92 ms	Baseline
ONNX + GPU	~35 ms	2.6x faster ✨
ONNX + CPU	~60-80 ms	1.2x faster

Real-Time Factor (RTF): 0.44x on GPU (generates audio faster than playback!)

Quick Start

Installation

pip install onnxruntime-gpu numpy

For CPU-only:

pip install onnxruntime numpy

Usage

import numpy as np
import onnxruntime as ort

# Load model
session = ort.InferenceSession(
    "nano_codec_decoder.onnx",
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)

# Prepare input
tokens = np.random.randint(0, 500, (1, 4, 10), dtype=np.int64)  # [batch, codebooks, frames]
tokens_len = np.array([10], dtype=np.int64)

# Run inference
outputs = session.run(
    None,
    {"tokens": tokens, "tokens_len": tokens_len}
)

audio, audio_len = outputs
print(f"Generated audio: {audio.shape}")  # [1, 17640] samples

Integration with KaniTTS

from onnx_decoder_optimized import ONNXKaniTTSDecoderOptimized

# Initialize decoder
decoder = ONNXKaniTTSDecoderOptimized(
    onnx_model_path="nano_codec_decoder.onnx",
    device="cuda"
)

# Decode frame (4 codec tokens)
codes = [100, 200, 300, 400]
audio = decoder.decode_frame(codes)  # Returns int16 numpy array

Model Architecture

The decoder consists of two stages:

Dequantization (FSQ): Converts token indices to latent representation
- Input: [batch, 4, frames] → Output: [batch, 16, frames]
Audio Decoder (HiFiGAN): Generates audio from latents
- Input: [batch, 16, frames] → Output: [batch, samples]
- Upsampling factor: ~1764x (80ms per frame at 22050 Hz)

Export Details

Source Model: nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps
Export Method: PyTorch → ONNX (legacy exporter)
Opset Version: 14
Dynamic Axes: Frame dimension and audio samples
Optimizations: Graph optimization enabled, constant folding

Use Cases

Text-to-Speech Systems: Fast neural codec decoding
Real-time Audio Generation: Sub-realtime performance on GPU
Streaming TTS: Low-latency frame-by-frame decoding
KaniTTS Integration: Drop-in replacement for PyTorch decoder

Requirements

GPU (Recommended)

CUDA 11.8+ or 12.x
cuDNN 8.x or 9.x
ONNX Runtime GPU: pip install onnxruntime-gpu

CPU

Any modern CPU
ONNX Runtime: pip install onnxruntime

Inputs

tokens (int64): Codec token indices
- Shape: [batch_size, 4, num_frames]
- Range: [0, 499] (FSQ codebook indices)
tokens_len (int64): Number of frames
- Shape: [batch_size]
- Value: Number of frames in the sequence

Outputs

audio (float32): Generated audio waveform
- Shape: [batch_size, num_samples]
- Range: [-1.0, 1.0]
- Sample rate: 22050 Hz
audio_len (int64): Audio length
- Shape: [batch_size]
- Value: Number of audio samples

Accuracy

Compared to PyTorch reference implementation:

Mean Absolute Error: 0.0087
Correlation: 1.000000 (perfect)
Relative Error: 0.0006%

Audio quality is virtually identical to PyTorch version.

Limitations

Fixed sample rate (22050 Hz)
Single-channel (mono) audio only
Requires valid FSQ token indices (0-499 range)
Best performance on NVIDIA GPUs with CUDA support

License

Apache 2.0 (same as source model)

Acknowledgments

NVIDIA NeMo team for the original NanoCodec
ONNX Runtime team for the inference engine
KaniTTS team for the TTS system

Downloads last month: -; Downloads are not tracked for this model. How to track

Prasanna05
/

nano-codec-decoder-onnx