File size: 8,993 Bytes
9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c f425c2a 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 68f9b5c 9a2f6d4 2361f4c 68f9b5c 9a2f6d4 68f9b5c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 |
---
license: apache-2.0
base_model: moonshotai/Kimi-K2-Instruct-0905
tags:
- mlx
- quantized
- kimi
- deepseek-v3
- moe
- instruction-following
- 8-bit
- apple-silicon
model_type: kimi_k2
pipeline_tag: text-generation
language:
- en
- zh
library_name: mlx
---
<div align="center">
# ๐ Kimi K2 Instruct - MLX 8-bit
### State-of-the-Art 671B MoE Model, Optimized for Apple Silicon
[](https://github.com/ml-explore/mlx)
[](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit)
[](https://github.com/ml-explore/mlx)
[](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
[](https://opensource.org/licenses/Apache-2.0)
**[Original Model](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)** | **[MLX Framework](https://github.com/ml-explore/mlx)** | **[More Quantizations](#-other-quantization-options)**
---
</div>
## ๐ What is This?
This is a **high-quality 8-bit quantized version** of Kimi K2 Instruct, optimized to run on **Apple Silicon** (M1/M2/M3/M4) Macs using the MLX framework. Think of it as taking a massive 671-billion parameter AI model and compressing it down to ~1 TB while keeping almost all of its intelligence intact!
### โจ Why You'll Love It
- ๐ **Massive Context Window** - Handle up to 262,144 tokens (~200,000 words!)
- ๐ง **671B Parameters** - One of the most capable open models available
- โก **Apple Silicon Native** - Fully optimized for M-series chips with Metal acceleration
- ๐ฏ **8-bit Precision** - Best quality-to-size ratio for serious work
- ๐ **Bilingual** - Fluent in both English and Chinese
- ๐ฌ **Instruction-Tuned** - Ready for conversations, coding, analysis, and more
## ๐ฏ Quick Start
#
## Hardware Requirements
Kimi-K2 is a massive 671B parameter MoE model. Choose your quantization based on available unified memory:
| Quantization | Model Size | Min RAM | Quality |
|:------------:|:----------:|:-------:|:--------|
| **2-bit** | ~84 GB | 96 GB | Acceptable - some quality loss |
| **3-bit** | ~126 GB | 128 GB | Good - recommended minimum |
| **4-bit** | ~168 GB | 192 GB | Very Good - best quality/size balance |
| **5-bit** | ~210 GB | 256 GB | Excellent |
| **6-bit** | ~252 GB | 288 GB | Near original |
| **8-bit** | ~336 GB | 384 GB | Original quality |
### Recommended Configurations
| Mac Model | Max RAM | Recommended Quantization |
|:----------|:-------:|:-------------------------|
| Mac Studio M2 Ultra | 192 GB | 4-bit |
| Mac Studio M4 Ultra | 512 GB | 8-bit |
| Mac Pro M2 Ultra | 192 GB | 4-bit |
| MacBook Pro M3 Max | 128 GB | 3-bit |
| MacBook Pro M4 Max | 128 GB | 3-bit |
### Performance Notes
- **Inference Speed**: Expect ~5-15 tokens/sec depending on quantization and hardware
- **First Token Latency**: 10-30 seconds for model loading
- **Context Window**: Full 128K context supported
- **Active Parameters**: Only ~37B parameters active per token (MoE architecture)
## Installation
```bash
pip install mlx-lm
```
### Your First Generation (3 lines of code!)
```python
from mlx_lm import load, generate
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
print(generate(model, tokenizer, prompt="Explain quantum entanglement simply:", max_tokens=200))
```
That's it! ๐
## ๐ป System Requirements
| Component | Minimum | Recommended |
|-----------|---------|-------------|
| **Mac** | M1 or newer | M2 Ultra / M3 Max / M4 Max+ |
| **Memory** | 64 GB unified | 128 GB+ unified |
| **Storage** | 1.1 TB free | Fast SSD (2+ TB) |
| **macOS** | 12.0+ | Latest version |
> โ ๏ธ **Note:** This is a HUGE model! Make sure you have enough RAM and storage.
## ๐ Usage Examples
### Command Line Interface
```bash
mlx_lm.generate \
--model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \
--prompt "Write a Python script to analyze CSV files." \
--max-tokens 500
```
### Chat Conversation
```python
from mlx_lm import load, generate
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
conversation = """<|im_start|>system
You are a helpful AI assistant specialized in coding and problem-solving.<|im_end|>
<|im_start|>user
Can you help me optimize this Python code?<|im_end|>
<|im_start|>assistant
"""
response = generate(model, tokenizer, prompt=conversation, max_tokens=500)
print(response)
```
### Advanced: Streaming Output
```python
from mlx_lm import load, generate
model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
for token in generate(
model,
tokenizer,
prompt="Tell me about the future of AI:",
max_tokens=500,
stream=True
):
print(token, end="", flush=True)
```
## ๐๏ธ Architecture Highlights
<details>
<summary><b>Click to expand technical details</b></summary>
### Model Specifications
| Feature | Value |
|---------|-------|
| **Total Parameters** | ~671 Billion |
| **Architecture** | DeepSeek V3 (MoE) |
| **Experts** | 384 routed + 1 shared |
| **Active Experts** | 8 per token |
| **Hidden Size** | 7168 |
| **Layers** | 61 |
| **Heads** | 56 |
| **Context Length** | 262,144 tokens |
| **Quantization** | 8.501 bits per weight |
### Advanced Features
- **๐ฏ YaRN Rope Scaling** - 64x factor for extended context
- **๐๏ธ KV Compression** - LoRA-based (rank 512)
- **โก Query Compression** - Q-LoRA (rank 1536)
- **๐งฎ MoE Routing** - Top-8 expert selection with sigmoid scoring
- **๐ง FP8 Training** - Pre-quantized with e4m3 precision
</details>
## ๐จ Other Quantization Options
Choose the right balance for your needs:
| Quantization | Size | Quality | Speed | Best For |
|--------------|------|---------|-------|----------|
| **8-bit** (you are here) | ~1 TB | โญโญโญโญโญ | โญโญโญ | Production, best quality |
| [6-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit) | ~800 GB | โญโญโญโญ | โญโญโญโญ | Sweet spot for most users |
| [4-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-4bit) | ~570 GB | โญโญโญ | โญโญโญโญโญ | Faster inference |
| [2-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-2bit) | ~320 GB | โญโญ | โญโญโญโญโญ | Experimental |
| [Original](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) | ~5 TB | โญโญโญโญโญ | โญโญ | Research only |
## ๐ง How It Was Made
This model was quantized using MLX's built-in quantization:
```bash
mlx_lm.convert \
--hf-path moonshotai/Kimi-K2-Instruct-0905 \
--mlx-path Kimi-K2-Instruct-0905-MLX-8bit \
-q --q-bits 8 \
--trust-remote-code
```
**Result:** 8.501 bits per weight (includes metadata overhead)
## โก Performance Tips
<details>
<summary><b>Getting the best performance</b></summary>
1. **Close other applications** - Free up as much RAM as possible
2. **Use an external SSD** - If your internal drive is full
3. **Monitor memory** - Watch Activity Monitor during inference
4. **Adjust batch size** - If you get OOM errors, reduce max_tokens
5. **Keep your Mac cool** - Good airflow helps maintain peak performance
</details>
## โ ๏ธ Known Limitations
- ๐ **Apple Silicon Only** - Won't work on Intel Macs or NVIDIA GPUs
- ๐พ **Huge Storage Needs** - Make sure you have 1.1 TB+ free
- ๐ **RAM Intensive** - Needs 64+ GB unified memory minimum
- ๐ **Slower on M1** - Best performance on M2 Ultra or newer
- ๐ **Bilingual Focus** - Optimized for English and Chinese
## ๐ License
Apache 2.0 - Same as the original model. Free for commercial use!
## ๐ Acknowledgments
- **Original Model:** [Moonshot AI](https://www.moonshot.cn/) for creating Kimi K2
- **Framework:** Apple's [MLX team](https://github.com/ml-explore/mlx) for the amazing framework
- **Inspiration:** DeepSeek V3 architecture
## ๐ Citation
If you use this model in your research or product, please cite:
```bibtex
@misc{kimi-k2-2025,
title={Kimi K2: Advancing Long-Context Language Models},
author={Moonshot AI},
year={2025},
url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
}
```
## ๐ Useful Links
- ๐ฆ **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
- ๐ ๏ธ **MLX Framework:** [GitHub](https://github.com/ml-explore/mlx)
- ๐ **MLX LM Docs:** [GitHub](https://github.com/ml-explore/mlx-examples/tree/main/llms)
- ๐ฌ **Discussions:** [Ask questions here!](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit/discussions)
---
<div align="center">
**Quantized with โค๏ธ by [richardyoung](https://deepneuro.ai/richard)**
*If you find this useful, please โญ star the repo and share with others!*
**Created:** October 2025 | **Format:** MLX 8-bit
</div>
|