README.md · richardyoung/Kimi-K2-Instruct-0905-MLX-8bit at main

File size: 8,993 Bytes

9a2f6d4
 
 
 
 
 
 
 
 
 
 
68f9b5c
9a2f6d4
 
68f9b5c
 
 
 
9a2f6d4
 
68f9b5c
9a2f6d4
68f9b5c
9a2f6d4
68f9b5c
9a2f6d4
68f9b5c
 
 
 
 
9a2f6d4
68f9b5c
9a2f6d4
68f9b5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a2f6d4
68f9b5c
 
f425c2a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a2f6d4
 
 
 
 
68f9b5c
9a2f6d4
 
 
 
 
68f9b5c
9a2f6d4
 
68f9b5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a2f6d4
 
 
 
68f9b5c
9a2f6d4
 
 
68f9b5c
9a2f6d4
 
 
 
 
 
 
68f9b5c
9a2f6d4
68f9b5c
9a2f6d4
 
 
68f9b5c
9a2f6d4
 
 
68f9b5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a2f6d4
68f9b5c
9a2f6d4
68f9b5c
 
9a2f6d4
68f9b5c
9a2f6d4
68f9b5c
 
 
 
 
 
 
 
 
 
 
9a2f6d4
68f9b5c
9a2f6d4
68f9b5c
 
 
 
 
9a2f6d4
68f9b5c
9a2f6d4
68f9b5c
9a2f6d4
68f9b5c
9a2f6d4
68f9b5c
 
 
 
 
 
 
9a2f6d4
68f9b5c
9a2f6d4
 
 
 
 
 
 
68f9b5c
 
9a2f6d4
 
68f9b5c
 
 
 
 
 
9a2f6d4
68f9b5c
 
 
 
 
9a2f6d4
68f9b5c
9a2f6d4
68f9b5c
9a2f6d4
68f9b5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a2f6d4
 
 
 
 
 
 
 
 
 
68f9b5c
9a2f6d4
68f9b5c
 
 
 
9a2f6d4
68f9b5c
9a2f6d4
68f9b5c
9a2f6d4
2361f4c
68f9b5c
 
 
 
9a2f6d4
68f9b5c

---
license: apache-2.0
base_model: moonshotai/Kimi-K2-Instruct-0905
tags:
  - mlx
  - quantized
  - kimi
  - deepseek-v3
  - moe
  - instruction-following
  - 8-bit
  - apple-silicon
model_type: kimi_k2
pipeline_tag: text-generation
language:
- en
- zh
library_name: mlx
---

<div align="center">

# 🌙 Kimi K2 Instruct - MLX 8-bit

### State-of-the-Art 671B MoE Model, Optimized for Apple Silicon

[![MLX](https://img.shields.io/badge/MLX-Optimized-blue?logo=apple)](https://github.com/ml-explore/mlx)
[![Model Size](https://img.shields.io/badge/Size-1.0_TB-green)](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit)
[![Quantization](https://img.shields.io/badge/Quantization-8--bit-orange)](https://github.com/ml-explore/mlx)
[![Context](https://img.shields.io/badge/Context-262K_tokens-purple)](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

**[Original Model](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)** | **[MLX Framework](https://github.com/ml-explore/mlx)** | **[More Quantizations](#-other-quantization-options)**

---

</div>

## 📖 What is This?

This is a **high-quality 8-bit quantized version** of Kimi K2 Instruct, optimized to run on **Apple Silicon** (M1/M2/M3/M4) Macs using the MLX framework. Think of it as taking a massive 671-billion parameter AI model and compressing it down to ~1 TB while keeping almost all of its intelligence intact!

### ✨ Why You'll Love It

- 🚀 **Massive Context Window** - Handle up to 262,144 tokens (~200,000 words!)
- 🧠 **671B Parameters** - One of the most capable open models available
- ⚡ **Apple Silicon Native** - Fully optimized for M-series chips with Metal acceleration
- 🎯 **8-bit Precision** - Best quality-to-size ratio for serious work
- 🌏 **Bilingual** - Fluent in both English and Chinese
- 💬 **Instruction-Tuned** - Ready for conversations, coding, analysis, and more

## 🎯 Quick Start

#

## Hardware Requirements

Kimi-K2 is a massive 671B parameter MoE model. Choose your quantization based on available unified memory:

| Quantization | Model Size | Min RAM | Quality |
|:------------:|:----------:|:-------:|:--------|
| **2-bit** | ~84 GB | 96 GB | Acceptable - some quality loss |
| **3-bit** | ~126 GB | 128 GB | Good - recommended minimum |
| **4-bit** | ~168 GB | 192 GB | Very Good - best quality/size balance |
| **5-bit** | ~210 GB | 256 GB | Excellent |
| **6-bit** | ~252 GB | 288 GB | Near original |
| **8-bit** | ~336 GB | 384 GB | Original quality |

### Recommended Configurations

| Mac Model | Max RAM | Recommended Quantization |
|:----------|:-------:|:-------------------------|
| Mac Studio M2 Ultra | 192 GB | 4-bit |
| Mac Studio M4 Ultra | 512 GB | 8-bit |
| Mac Pro M2 Ultra | 192 GB | 4-bit |
| MacBook Pro M3 Max | 128 GB | 3-bit |
| MacBook Pro M4 Max | 128 GB | 3-bit |

### Performance Notes

- **Inference Speed**: Expect ~5-15 tokens/sec depending on quantization and hardware
- **First Token Latency**: 10-30 seconds for model loading
- **Context Window**: Full 128K context supported
- **Active Parameters**: Only ~37B parameters active per token (MoE architecture)


## Installation

```bash
pip install mlx-lm
```

### Your First Generation (3 lines of code!)

```python
from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
print(generate(model, tokenizer, prompt="Explain quantum entanglement simply:", max_tokens=200))
```

That's it! 🎉

## 💻 System Requirements

| Component | Minimum | Recommended |
|-----------|---------|-------------|
| **Mac** | M1 or newer | M2 Ultra / M3 Max / M4 Max+ |
| **Memory** | 64 GB unified | 128 GB+ unified |
| **Storage** | 1.1 TB free | Fast SSD (2+ TB) |
| **macOS** | 12.0+ | Latest version |

> ⚠️ **Note:** This is a HUGE model! Make sure you have enough RAM and storage.

## 📚 Usage Examples

### Command Line Interface

```bash
mlx_lm.generate \
  --model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \
  --prompt "Write a Python script to analyze CSV files." \
  --max-tokens 500
```

### Chat Conversation

```python
from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")

conversation = """<|im_start|>system
You are a helpful AI assistant specialized in coding and problem-solving.<|im_end|>
<|im_start|>user
Can you help me optimize this Python code?<|im_end|>
<|im_start|>assistant
"""

response = generate(model, tokenizer, prompt=conversation, max_tokens=500)
print(response)
```

### Advanced: Streaming Output

```python
from mlx_lm import load, generate

model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")

for token in generate(
    model, 
    tokenizer, 
    prompt="Tell me about the future of AI:", 
    max_tokens=500,
    stream=True
):
    print(token, end="", flush=True)
```

## 🏗️ Architecture Highlights

<details>
<summary><b>Click to expand technical details</b></summary>

### Model Specifications

| Feature | Value |
|---------|-------|
| **Total Parameters** | ~671 Billion |
| **Architecture** | DeepSeek V3 (MoE) |
| **Experts** | 384 routed + 1 shared |
| **Active Experts** | 8 per token |
| **Hidden Size** | 7168 |
| **Layers** | 61 |
| **Heads** | 56 |
| **Context Length** | 262,144 tokens |
| **Quantization** | 8.501 bits per weight |

### Advanced Features

- **🎯 YaRN Rope Scaling** - 64x factor for extended context
- **🗜️ KV Compression** - LoRA-based (rank 512)
- **⚡ Query Compression** - Q-LoRA (rank 1536)
- **🧮 MoE Routing** - Top-8 expert selection with sigmoid scoring
- **🔧 FP8 Training** - Pre-quantized with e4m3 precision

</details>

## 🎨 Other Quantization Options

Choose the right balance for your needs:

| Quantization | Size | Quality | Speed | Best For |
|--------------|------|---------|-------|----------|
| **8-bit** (you are here) | ~1 TB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | Production, best quality |
| [6-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit) | ~800 GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Sweet spot for most users |
| [4-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-4bit) | ~570 GB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Faster inference |
| [2-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-2bit) | ~320 GB | ⭐⭐ | ⭐⭐⭐⭐⭐ | Experimental |
| [Original](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) | ~5 TB | ⭐⭐⭐⭐⭐ | ⭐⭐ | Research only |

## 🔧 How It Was Made

This model was quantized using MLX's built-in quantization:

```bash
mlx_lm.convert \
  --hf-path moonshotai/Kimi-K2-Instruct-0905 \
  --mlx-path Kimi-K2-Instruct-0905-MLX-8bit \
  -q --q-bits 8 \
  --trust-remote-code
```

**Result:** 8.501 bits per weight (includes metadata overhead)

## ⚡ Performance Tips

<details>
<summary><b>Getting the best performance</b></summary>

1. **Close other applications** - Free up as much RAM as possible
2. **Use an external SSD** - If your internal drive is full
3. **Monitor memory** - Watch Activity Monitor during inference
4. **Adjust batch size** - If you get OOM errors, reduce max_tokens
5. **Keep your Mac cool** - Good airflow helps maintain peak performance

</details>

## ⚠️ Known Limitations

- 🍎 **Apple Silicon Only** - Won't work on Intel Macs or NVIDIA GPUs
- 💾 **Huge Storage Needs** - Make sure you have 1.1 TB+ free
- 🐏 **RAM Intensive** - Needs 64+ GB unified memory minimum
- 🐌 **Slower on M1** - Best performance on M2 Ultra or newer
- 🌐 **Bilingual Focus** - Optimized for English and Chinese

## 📄 License

Apache 2.0 - Same as the original model. Free for commercial use!

## 🙏 Acknowledgments

- **Original Model:** [Moonshot AI](https://www.moonshot.cn/) for creating Kimi K2
- **Framework:** Apple's [MLX team](https://github.com/ml-explore/mlx) for the amazing framework
- **Inspiration:** DeepSeek V3 architecture

## 📚 Citation

If you use this model in your research or product, please cite:

```bibtex
@misc{kimi-k2-2025,
  title={Kimi K2: Advancing Long-Context Language Models},
  author={Moonshot AI},
  year={2025},
  url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}
```

## 🔗 Useful Links

- 📦 **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
- 🛠️ **MLX Framework:** [GitHub](https://github.com/ml-explore/mlx)
- 📖 **MLX LM Docs:** [GitHub](https://github.com/ml-explore/mlx-examples/tree/main/llms)
- 💬 **Discussions:** [Ask questions here!](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit/discussions)

---

<div align="center">

**Quantized with ❤️ by [richardyoung](https://deepneuro.ai/richard)**

*If you find this useful, please ⭐ star the repo and share with others!*

**Created:** October 2025 | **Format:** MLX 8-bit

</div>