CoreML
Collection
Models for Apple devices. See https://github.com/FluidInference/FluidAudio for usage details
•
12 items
•
Updated
•
5
CoreML conversion of kyutai/pocket-tts for on-device inference on Apple platforms.
| Model | Description | Size |
|---|---|---|
| cond_step | KV cache prefill (voice + text conditioning) | ~200MB |
| flowlm_step | Autoregressive generation (transformer_out + EOS) | ~200MB |
| flow_decoder | Flow matching denoiser (8 Euler steps per frame) | ~190MB |
| mimi_decoder | Streaming audio codec (1920 samples per frame) | ~11MB |
4 pre-encoded voices in constants_bin/:
alba (default), azelma, cosette, javertVoice cloning weights are not included — they are gated separately by Kyutai.
import FluidAudioTTS
let manager = PocketTtsManager()
try await manager.initialize()
let audio = try await manager.synthesize(text: "Hello, world!")
See https://github.com/FluidInference/FluidAudio for the full Swift framework.
License
CC-BY-4.0, inherited from https://huggingface.co/kyutai/pocket-tts. Attribution to Kyutai is required.
References
- https://huggingface.co/kyutai/pocket-tts
- https://arxiv.org/abs/2410.00037
- https://github.com/FluidInference/FluidAudio