PocketTTS CoreML

CoreML conversion of kyutai/pocket-tts for on-device inference on Apple platforms.

Models

Model	Description	Size
cond_step	KV cache prefill (voice + text conditioning)	~200MB
flowlm_step	Autoregressive generation (transformer_out + EOS)	~200MB
flow_decoder	Flow matching denoiser (8 Euler steps per frame)	~190MB
mimi_decoder	Streaming audio codec (1920 samples per frame)	~11MB

Voices

4 pre-encoded voices in constants_bin/:

alba (default), azelma, cosette, javert

Voice cloning weights are not included — they are gated separately by Kyutai.

Usage

import FluidAudioTTS

let manager = PocketTtsManager()
try await manager.initialize()
let audio = try await manager.synthesize(text: "Hello, world!")

See https://github.com/FluidInference/FluidAudio for the full Swift framework.

License

CC-BY-4.0, inherited from https://huggingface.co/kyutai/pocket-tts. Attribution to Kyutai is required.

References

- https://huggingface.co/kyutai/pocket-tts
- https://arxiv.org/abs/2410.00037
- https://github.com/FluidInference/FluidAudio

Downloads last month: 73

Collection including FluidInference/pocket-tts-coreml

CoreML

Collection

Models for Apple devices. See https://github.com/FluidInference/FluidAudio for usage details • 12 items • Updated 1 day ago • 5

Paper for FluidInference/pocket-tts-coreml

Moshi: a speech-text foundation model for real-time dialogue

Paper • 2410.00037 • Published Sep 17, 2024 • 11