XTTS v2 — Kinyarwanda

A fine-tuned Coqui XTTS v2 text-to-speech model for Kinyarwanda (rw), trained on speech from Commonvoice.

Usage

Requirements

The upstream TTS package requires a patched installation. Clone the fine-tuning repo and install its dependencies:

git clone https://github.com/Alexgichamba/XTTSv2-Finetuning-for-New-Languages.git
cd XTTSv2-Finetuning-for-New-Languages
pip install -r requirements.txt

Quick Start

import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

# Load model
config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path="model.pth", vocab_path="vocab.json", use_deepspeed=False)
model.to("cuda" if torch.cuda.is_available() else "cpu")

# Get speaker embedding from a reference audio clip
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
    audio_path="reference_speaker.wav",
    gpt_cond_len=model.config.gpt_cond_len,
    max_ref_length=model.config.max_ref_len,
    sound_norm_refs=model.config.sound_norm_refs,
)

# Synthesize
result = model.inference(
    text="Ndashaka amazi n'ibiryo",
    language="rw",
    gpt_cond_latent=gpt_cond_latent,
    speaker_embedding=speaker_embedding,
    temperature=0.1,
    length_penalty=1.0,
    repetition_penalty=10.0,
    top_k=10,
    top_p=0.3,
)

torchaudio.save("output.wav", torch.tensor(result["wav"]).unsqueeze(0), 24000)

CLI Inference

A full inference script is included:

python inference.py \
  -t "Ndashaka amazi n'ibiryo" \
  -s reference_speaker.wav \
  -l rw \
  -o output.wav

Files

model.pth — Model weights (85k-step checkpoint)
config.json — Model configuration
vocab.json — Tokenizer vocabulary
inference.py — Standalone inference script
reference_speaker.wav — Sample reference audio for voice cloning

Downloads last month: 13