XTTS v2 β Kinyarwanda
A fine-tuned Coqui XTTS v2 text-to-speech model for Kinyarwanda (rw), trained on speech from Commonvoice.
Usage
Requirements
The upstream TTS package requires a patched installation. Clone the fine-tuning repo and install its dependencies:
git clone https://github.com/Alexgichamba/XTTSv2-Finetuning-for-New-Languages.git
cd XTTSv2-Finetuning-for-New-Languages
pip install -r requirements.txt
Quick Start
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
# Load model
config = XttsConfig()
config.load_json("config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_path="model.pth", vocab_path="vocab.json", use_deepspeed=False)
model.to("cuda" if torch.cuda.is_available() else "cpu")
# Get speaker embedding from a reference audio clip
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
audio_path="reference_speaker.wav",
gpt_cond_len=model.config.gpt_cond_len,
max_ref_length=model.config.max_ref_len,
sound_norm_refs=model.config.sound_norm_refs,
)
# Synthesize
result = model.inference(
text="Ndashaka amazi n'ibiryo",
language="rw",
gpt_cond_latent=gpt_cond_latent,
speaker_embedding=speaker_embedding,
temperature=0.1,
length_penalty=1.0,
repetition_penalty=10.0,
top_k=10,
top_p=0.3,
)
torchaudio.save("output.wav", torch.tensor(result["wav"]).unsqueeze(0), 24000)
CLI Inference
A full inference script is included:
python inference.py \
-t "Ndashaka amazi n'ibiryo" \
-s reference_speaker.wav \
-l rw \
-o output.wav
Files
model.pthβ Model weights (85k-step checkpoint)config.jsonβ Model configurationvocab.jsonβ Tokenizer vocabularyinference.pyβ Standalone inference scriptreference_speaker.wavβ Sample reference audio for voice cloning
- Downloads last month
- 13