Kimi-K2-Instruct-0905-MLX-8bit / README.md

richardyoung

Upload README.md with huggingface_hub

f425c2a verified 17 days ago

preview code

raw

history blame contribute delete

8.99 kB

metadata

license: apache-2.0
base_model: moonshotai/Kimi-K2-Instruct-0905
tags:
  - mlx
  - quantized
  - kimi
  - deepseek-v3
  - moe
  - instruction-following
  - 8-bit
  - apple-silicon
model_type: kimi_k2
pipeline_tag: text-generation
language:
  - en
  - zh
library_name: mlx

🌙 Kimi K2 Instruct - MLX 8-bit

State-of-the-Art 671B MoE Model, Optimized for Apple Silicon

Original Model | MLX Framework | More Quantizations

📖 What is This?

This is a high-quality 8-bit quantized version of Kimi K2 Instruct, optimized to run on Apple Silicon (M1/M2/M3/M4) Macs using the MLX framework. Think of it as taking a massive 671-billion parameter AI model and compressing it down to ~1 TB while keeping almost all of its intelligence intact!

✨ Why You'll Love It

🚀 Massive Context Window - Handle up to 262,144 tokens (~200,000 words!)
🧠 671B Parameters - One of the most capable open models available
⚡ Apple Silicon Native - Fully optimized for M-series chips with Metal acceleration
🎯 8-bit Precision - Best quality-to-size ratio for serious work
🌏 Bilingual - Fluent in both English and Chinese
💬 Instruction-Tuned - Ready for conversations, coding, analysis, and more

🎯 Quick Start

Hardware Requirements

Kimi-K2 is a massive 671B parameter MoE model. Choose your quantization based on available unified memory:

Quantization	Model Size	Min RAM	Quality
2-bit	~84 GB	96 GB	Acceptable - some quality loss
3-bit	~126 GB	128 GB	Good - recommended minimum
4-bit	~168 GB	192 GB	Very Good - best quality/size balance
5-bit	~210 GB	256 GB	Excellent
6-bit	~252 GB	288 GB	Near original
8-bit	~336 GB	384 GB	Original quality

Recommended Configurations

Mac Model	Max RAM	Recommended Quantization
Mac Studio M2 Ultra	192 GB	4-bit
Mac Studio M4 Ultra	512 GB	8-bit
Mac Pro M2 Ultra	192 GB	4-bit
MacBook Pro M3 Max	128 GB	3-bit
MacBook Pro M4 Max	128 GB	3-bit

Performance Notes

Inference Speed: Expect ~5-15 tokens/sec depending on quantization and hardware
First Token Latency: 10-30 seconds for model loading
Context Window: Full 128K context supported
Active Parameters: Only ~37B parameters active per token (MoE architecture)

Installation

pip install mlx-lm

Your First Generation (3 lines of code!)

from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
print(generate(model, tokenizer, prompt="Explain quantum entanglement simply:", max_tokens=200))

That's it! 🎉

💻 System Requirements

Component	Minimum	Recommended
Mac	M1 or newer	M2 Ultra / M3 Max / M4 Max+
Memory	64 GB unified	128 GB+ unified
Storage	1.1 TB free	Fast SSD (2+ TB)
macOS	12.0+	Latest version

⚠️ Note: This is a HUGE model! Make sure you have enough RAM and storage.

📚 Usage Examples

Command Line Interface

mlx_lm.generate \
  --model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \
  --prompt "Write a Python script to analyze CSV files." \
  --max-tokens 500

Chat Conversation

from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")

conversation = """<|im_start|>system
You are a helpful AI assistant specialized in coding and problem-solving.<|im_end|>
<|im_start|>user
Can you help me optimize this Python code?<|im_end|>
<|im_start|>assistant
"""

response = generate(model, tokenizer, prompt=conversation, max_tokens=500)
print(response)

Advanced: Streaming Output

from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")

for token in generate(
    model, 
    tokenizer, 
    prompt="Tell me about the future of AI:", 
    max_tokens=500,
    stream=True
):
    print(token, end="", flush=True)

🏗️ Architecture Highlights

Click to expand technical details

Model Specifications

Feature	Value
Total Parameters	~671 Billion
Architecture	DeepSeek V3 (MoE)
Experts	384 routed + 1 shared
Active Experts	8 per token
Hidden Size	7168
Layers	61
Heads	56
Context Length	262,144 tokens
Quantization	8.501 bits per weight

Advanced Features

🎯 YaRN Rope Scaling - 64x factor for extended context
🗜️ KV Compression - LoRA-based (rank 512)
⚡ Query Compression - Q-LoRA (rank 1536)
🧮 MoE Routing - Top-8 expert selection with sigmoid scoring
🔧 FP8 Training - Pre-quantized with e4m3 precision

🎨 Other Quantization Options

Choose the right balance for your needs:

Quantization	Size	Quality	Speed	Best For
8-bit (you are here)	~1 TB	⭐⭐⭐⭐⭐	⭐⭐⭐	Production, best quality
6-bit	~800 GB	⭐⭐⭐⭐	⭐⭐⭐⭐	Sweet spot for most users
4-bit	~570 GB	⭐⭐⭐	⭐⭐⭐⭐⭐	Faster inference
2-bit	~320 GB	⭐⭐	⭐⭐⭐⭐⭐	Experimental
Original	~5 TB	⭐⭐⭐⭐⭐	⭐⭐	Research only

🔧 How It Was Made

This model was quantized using MLX's built-in quantization:

mlx_lm.convert \
  --hf-path moonshotai/Kimi-K2-Instruct-0905 \
  --mlx-path Kimi-K2-Instruct-0905-MLX-8bit \
  -q --q-bits 8 \
  --trust-remote-code

Result: 8.501 bits per weight (includes metadata overhead)

⚡ Performance Tips

Getting the best performance

Close other applications - Free up as much RAM as possible
Use an external SSD - If your internal drive is full
Monitor memory - Watch Activity Monitor during inference
Adjust batch size - If you get OOM errors, reduce max_tokens
Keep your Mac cool - Good airflow helps maintain peak performance

⚠️ Known Limitations

🍎 Apple Silicon Only - Won't work on Intel Macs or NVIDIA GPUs
💾 Huge Storage Needs - Make sure you have 1.1 TB+ free
🐏 RAM Intensive - Needs 64+ GB unified memory minimum
🐌 Slower on M1 - Best performance on M2 Ultra or newer
🌐 Bilingual Focus - Optimized for English and Chinese

📄 License

Apache 2.0 - Same as the original model. Free for commercial use!

🙏 Acknowledgments

Original Model: Moonshot AI for creating Kimi K2
Framework: Apple's MLX team for the amazing framework
Inspiration: DeepSeek V3 architecture

📚 Citation

If you use this model in your research or product, please cite:

@misc{kimi-k2-2025,
  title={Kimi K2: Advancing Long-Context Language Models},
  author={Moonshot AI},
  year={2025},
  url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}

🔗 Useful Links

📦 Original Model: moonshotai/Kimi-K2-Instruct-0905
🛠️ MLX Framework: GitHub
📖 MLX LM Docs: GitHub
💬 Discussions: Ask questions here!

Quantized with ❤️ by richardyoung

If you find this useful, please ⭐ star the repo and share with others!

Created: October 2025 | Format: MLX 8-bit