|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: moonshotai/Kimi-K2-Instruct-0905 |
|
|
tags: |
|
|
- mlx |
|
|
- quantized |
|
|
- kimi |
|
|
- deepseek-v3 |
|
|
- moe |
|
|
- instruction-following |
|
|
- 8-bit |
|
|
- apple-silicon |
|
|
model_type: kimi_k2 |
|
|
pipeline_tag: text-generation |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
library_name: mlx |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
# 🌙 Kimi K2 Instruct - MLX 8-bit |
|
|
|
|
|
### State-of-the-Art 671B MoE Model, Optimized for Apple Silicon |
|
|
|
|
|
[](https://github.com/ml-explore/mlx) |
|
|
[](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit) |
|
|
[](https://github.com/ml-explore/mlx) |
|
|
[](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) |
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
|
|
|
**[Original Model](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)** | **[MLX Framework](https://github.com/ml-explore/mlx)** | **[More Quantizations](#-other-quantization-options)** |
|
|
|
|
|
--- |
|
|
|
|
|
</div> |
|
|
|
|
|
## 📖 What is This? |
|
|
|
|
|
This is a **high-quality 8-bit quantized version** of Kimi K2 Instruct, optimized to run on **Apple Silicon** (M1/M2/M3/M4) Macs using the MLX framework. Think of it as taking a massive 671-billion parameter AI model and compressing it down to ~1 TB while keeping almost all of its intelligence intact! |
|
|
|
|
|
### ✨ Why You'll Love It |
|
|
|
|
|
- 🚀 **Massive Context Window** - Handle up to 262,144 tokens (~200,000 words!) |
|
|
- 🧠 **671B Parameters** - One of the most capable open models available |
|
|
- ⚡ **Apple Silicon Native** - Fully optimized for M-series chips with Metal acceleration |
|
|
- 🎯 **8-bit Precision** - Best quality-to-size ratio for serious work |
|
|
- 🌏 **Bilingual** - Fluent in both English and Chinese |
|
|
- 💬 **Instruction-Tuned** - Ready for conversations, coding, analysis, and more |
|
|
|
|
|
## 🎯 Quick Start |
|
|
|
|
|
# |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
Kimi-K2 is a massive 671B parameter MoE model. Choose your quantization based on available unified memory: |
|
|
|
|
|
| Quantization | Model Size | Min RAM | Quality | |
|
|
|:------------:|:----------:|:-------:|:--------| |
|
|
| **2-bit** | ~84 GB | 96 GB | Acceptable - some quality loss | |
|
|
| **3-bit** | ~126 GB | 128 GB | Good - recommended minimum | |
|
|
| **4-bit** | ~168 GB | 192 GB | Very Good - best quality/size balance | |
|
|
| **5-bit** | ~210 GB | 256 GB | Excellent | |
|
|
| **6-bit** | ~252 GB | 288 GB | Near original | |
|
|
| **8-bit** | ~336 GB | 384 GB | Original quality | |
|
|
|
|
|
### Recommended Configurations |
|
|
|
|
|
| Mac Model | Max RAM | Recommended Quantization | |
|
|
|:----------|:-------:|:-------------------------| |
|
|
| Mac Studio M2 Ultra | 192 GB | 4-bit | |
|
|
| Mac Studio M4 Ultra | 512 GB | 8-bit | |
|
|
| Mac Pro M2 Ultra | 192 GB | 4-bit | |
|
|
| MacBook Pro M3 Max | 128 GB | 3-bit | |
|
|
| MacBook Pro M4 Max | 128 GB | 3-bit | |
|
|
|
|
|
### Performance Notes |
|
|
|
|
|
- **Inference Speed**: Expect ~5-15 tokens/sec depending on quantization and hardware |
|
|
- **First Token Latency**: 10-30 seconds for model loading |
|
|
- **Context Window**: Full 128K context supported |
|
|
- **Active Parameters**: Only ~37B parameters active per token (MoE architecture) |
|
|
|
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install mlx-lm |
|
|
``` |
|
|
|
|
|
### Your First Generation (3 lines of code!) |
|
|
|
|
|
```python |
|
|
from mlx_lm import load, generate |
|
|
|
|
|
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit") |
|
|
print(generate(model, tokenizer, prompt="Explain quantum entanglement simply:", max_tokens=200)) |
|
|
``` |
|
|
|
|
|
That's it! 🎉 |
|
|
|
|
|
## 💻 System Requirements |
|
|
|
|
|
| Component | Minimum | Recommended | |
|
|
|-----------|---------|-------------| |
|
|
| **Mac** | M1 or newer | M2 Ultra / M3 Max / M4 Max+ | |
|
|
| **Memory** | 64 GB unified | 128 GB+ unified | |
|
|
| **Storage** | 1.1 TB free | Fast SSD (2+ TB) | |
|
|
| **macOS** | 12.0+ | Latest version | |
|
|
|
|
|
> ⚠️ **Note:** This is a HUGE model! Make sure you have enough RAM and storage. |
|
|
|
|
|
## 📚 Usage Examples |
|
|
|
|
|
### Command Line Interface |
|
|
|
|
|
```bash |
|
|
mlx_lm.generate \ |
|
|
--model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \ |
|
|
--prompt "Write a Python script to analyze CSV files." \ |
|
|
--max-tokens 500 |
|
|
``` |
|
|
|
|
|
### Chat Conversation |
|
|
|
|
|
```python |
|
|
from mlx_lm import load, generate |
|
|
|
|
|
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit") |
|
|
|
|
|
conversation = """<|im_start|>system |
|
|
You are a helpful AI assistant specialized in coding and problem-solving.<|im_end|> |
|
|
<|im_start|>user |
|
|
Can you help me optimize this Python code?<|im_end|> |
|
|
<|im_start|>assistant |
|
|
""" |
|
|
|
|
|
response = generate(model, tokenizer, prompt=conversation, max_tokens=500) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### Advanced: Streaming Output |
|
|
|
|
|
```python |
|
|
from mlx_lm import load, generate |
|
|
|
|
|
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit") |
|
|
|
|
|
for token in generate( |
|
|
model, |
|
|
tokenizer, |
|
|
prompt="Tell me about the future of AI:", |
|
|
max_tokens=500, |
|
|
stream=True |
|
|
): |
|
|
print(token, end="", flush=True) |
|
|
``` |
|
|
|
|
|
## 🏗️ Architecture Highlights |
|
|
|
|
|
<details> |
|
|
<summary><b>Click to expand technical details</b></summary> |
|
|
|
|
|
### Model Specifications |
|
|
|
|
|
| Feature | Value | |
|
|
|---------|-------| |
|
|
| **Total Parameters** | ~671 Billion | |
|
|
| **Architecture** | DeepSeek V3 (MoE) | |
|
|
| **Experts** | 384 routed + 1 shared | |
|
|
| **Active Experts** | 8 per token | |
|
|
| **Hidden Size** | 7168 | |
|
|
| **Layers** | 61 | |
|
|
| **Heads** | 56 | |
|
|
| **Context Length** | 262,144 tokens | |
|
|
| **Quantization** | 8.501 bits per weight | |
|
|
|
|
|
### Advanced Features |
|
|
|
|
|
- **🎯 YaRN Rope Scaling** - 64x factor for extended context |
|
|
- **🗜️ KV Compression** - LoRA-based (rank 512) |
|
|
- **⚡ Query Compression** - Q-LoRA (rank 1536) |
|
|
- **🧮 MoE Routing** - Top-8 expert selection with sigmoid scoring |
|
|
- **🔧 FP8 Training** - Pre-quantized with e4m3 precision |
|
|
|
|
|
</details> |
|
|
|
|
|
## 🎨 Other Quantization Options |
|
|
|
|
|
Choose the right balance for your needs: |
|
|
|
|
|
| Quantization | Size | Quality | Speed | Best For | |
|
|
|--------------|------|---------|-------|----------| |
|
|
| **8-bit** (you are here) | ~1 TB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | Production, best quality | |
|
|
| [6-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit) | ~800 GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Sweet spot for most users | |
|
|
| [4-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-4bit) | ~570 GB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Faster inference | |
|
|
| [2-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-2bit) | ~320 GB | ⭐⭐ | ⭐⭐⭐⭐⭐ | Experimental | |
|
|
| [Original](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) | ~5 TB | ⭐⭐⭐⭐⭐ | ⭐⭐ | Research only | |
|
|
|
|
|
## 🔧 How It Was Made |
|
|
|
|
|
This model was quantized using MLX's built-in quantization: |
|
|
|
|
|
```bash |
|
|
mlx_lm.convert \ |
|
|
--hf-path moonshotai/Kimi-K2-Instruct-0905 \ |
|
|
--mlx-path Kimi-K2-Instruct-0905-MLX-8bit \ |
|
|
-q --q-bits 8 \ |
|
|
--trust-remote-code |
|
|
``` |
|
|
|
|
|
**Result:** 8.501 bits per weight (includes metadata overhead) |
|
|
|
|
|
## ⚡ Performance Tips |
|
|
|
|
|
<details> |
|
|
<summary><b>Getting the best performance</b></summary> |
|
|
|
|
|
1. **Close other applications** - Free up as much RAM as possible |
|
|
2. **Use an external SSD** - If your internal drive is full |
|
|
3. **Monitor memory** - Watch Activity Monitor during inference |
|
|
4. **Adjust batch size** - If you get OOM errors, reduce max_tokens |
|
|
5. **Keep your Mac cool** - Good airflow helps maintain peak performance |
|
|
|
|
|
</details> |
|
|
|
|
|
## ⚠️ Known Limitations |
|
|
|
|
|
- 🍎 **Apple Silicon Only** - Won't work on Intel Macs or NVIDIA GPUs |
|
|
- 💾 **Huge Storage Needs** - Make sure you have 1.1 TB+ free |
|
|
- 🐏 **RAM Intensive** - Needs 64+ GB unified memory minimum |
|
|
- 🐌 **Slower on M1** - Best performance on M2 Ultra or newer |
|
|
- 🌐 **Bilingual Focus** - Optimized for English and Chinese |
|
|
|
|
|
## 📄 License |
|
|
|
|
|
Apache 2.0 - Same as the original model. Free for commercial use! |
|
|
|
|
|
## 🙏 Acknowledgments |
|
|
|
|
|
- **Original Model:** [Moonshot AI](https://www.moonshot.cn/) for creating Kimi K2 |
|
|
- **Framework:** Apple's [MLX team](https://github.com/ml-explore/mlx) for the amazing framework |
|
|
- **Inspiration:** DeepSeek V3 architecture |
|
|
|
|
|
## 📚 Citation |
|
|
|
|
|
If you use this model in your research or product, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{kimi-k2-2025, |
|
|
title={Kimi K2: Advancing Long-Context Language Models}, |
|
|
author={Moonshot AI}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905} |
|
|
} |
|
|
``` |
|
|
|
|
|
## 🔗 Useful Links |
|
|
|
|
|
- 📦 **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) |
|
|
- 🛠️ **MLX Framework:** [GitHub](https://github.com/ml-explore/mlx) |
|
|
- 📖 **MLX LM Docs:** [GitHub](https://github.com/ml-explore/mlx-examples/tree/main/llms) |
|
|
- 💬 **Discussions:** [Ask questions here!](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit/discussions) |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**Quantized with ❤️ by [richardyoung](https://deepneuro.ai/richard)** |
|
|
|
|
|
*If you find this useful, please ⭐ star the repo and share with others!* |
|
|
|
|
|
**Created:** October 2025 | **Format:** MLX 8-bit |
|
|
|
|
|
</div> |
|
|
|