PocketTTS CoreML

CoreML conversion of kyutai/pocket-tts for on-device inference on Apple platforms.

Models

Model Description Size
cond_step KV cache prefill (voice + text conditioning) ~200MB
flowlm_step Autoregressive generation (transformer_out + EOS) ~200MB
flow_decoder Flow matching denoiser (8 Euler steps per frame) ~190MB
mimi_decoder Streaming audio codec (1920 samples per frame) ~11MB

Voices

4 pre-encoded voices in constants_bin/:

  • alba (default), azelma, cosette, javert

Voice cloning weights are not included — they are gated separately by Kyutai.

Usage

import FluidAudioTTS

let manager = PocketTtsManager()
try await manager.initialize()
let audio = try await manager.synthesize(text: "Hello, world!")

See https://github.com/FluidInference/FluidAudio for the full Swift framework.

License

CC-BY-4.0, inherited from https://huggingface.co/kyutai/pocket-tts. Attribution to Kyutai is required.

References

- https://huggingface.co/kyutai/pocket-tts
- https://arxiv.org/abs/2410.00037
- https://github.com/FluidInference/FluidAudio
Downloads last month
73
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including FluidInference/pocket-tts-coreml

Paper for FluidInference/pocket-tts-coreml