README.md · richardyoung/Kimi-K2-Instruct-0905-MLX-8bit at main

Kimi-K2-Instruct-0905-MLX-8bit / README.md

richardyoung

Upload README.md with huggingface_hub

f425c2a verified 20 days ago

preview code

raw

history blame contribute delete

8.99 kB

	---
	license: apache-2.0
	base_model: moonshotai/Kimi-K2-Instruct-0905
	tags:
	- mlx
	- quantized
	- kimi
	- deepseek-v3
	- moe
	- instruction-following
	- 8-bit
	- apple-silicon
	model_type: kimi_k2
	pipeline_tag: text-generation
	language:
	- en
	- zh
	library_name: mlx
	---

	<div align="center">

	# 🌙 Kimi K2 Instruct - MLX 8-bit

	### State-of-the-Art 671B MoE Model, Optimized for Apple Silicon

	[![MLX](https://img.shields.io/badge/MLX-Optimized-blue?logo=apple)](https://github.com/ml-explore/mlx)
	[![Model Size](https://img.shields.io/badge/Size-1.0_TB-green)](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit)
	[![Quantization](https://img.shields.io/badge/Quantization-8--bit-orange)](https://github.com/ml-explore/mlx)
	[![Context](https://img.shields.io/badge/Context-262K_tokens-purple)](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
	[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

	[Original Model](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) \| [MLX Framework](https://github.com/ml-explore/mlx) \| [More Quantizations](#-other-quantization-options)

	---

	</div>

	## 📖 What is This?

	This is a high-quality 8-bit quantized version of Kimi K2 Instruct, optimized to run on Apple Silicon (M1/M2/M3/M4) Macs using the MLX framework. Think of it as taking a massive 671-billion parameter AI model and compressing it down to ~1 TB while keeping almost all of its intelligence intact!

	### ✨ Why You'll Love It

	- 🚀 Massive Context Window - Handle up to 262,144 tokens (~200,000 words!)
	- 🧠 671B Parameters - One of the most capable open models available
	- ⚡ Apple Silicon Native - Fully optimized for M-series chips with Metal acceleration
	- 🎯 8-bit Precision - Best quality-to-size ratio for serious work
	- 🌏 Bilingual - Fluent in both English and Chinese
	- 💬 Instruction-Tuned - Ready for conversations, coding, analysis, and more

	## 🎯 Quick Start

	#

	## Hardware Requirements

	Kimi-K2 is a massive 671B parameter MoE model. Choose your quantization based on available unified memory:

	\| Quantization \| Model Size \| Min RAM \| Quality \|
	\|:------------:\|:----------:\|:-------:\|:--------\|
	\| 2-bit \| ~84 GB \| 96 GB \| Acceptable - some quality loss \|
	\| 3-bit \| ~126 GB \| 128 GB \| Good - recommended minimum \|
	\| 4-bit \| ~168 GB \| 192 GB \| Very Good - best quality/size balance \|
	\| 5-bit \| ~210 GB \| 256 GB \| Excellent \|
	\| 6-bit \| ~252 GB \| 288 GB \| Near original \|
	\| 8-bit \| ~336 GB \| 384 GB \| Original quality \|

	### Recommended Configurations

	\| Mac Model \| Max RAM \| Recommended Quantization \|
	\|:----------\|:-------:\|:-------------------------\|
	\| Mac Studio M2 Ultra \| 192 GB \| 4-bit \|
	\| Mac Studio M4 Ultra \| 512 GB \| 8-bit \|
	\| Mac Pro M2 Ultra \| 192 GB \| 4-bit \|
	\| MacBook Pro M3 Max \| 128 GB \| 3-bit \|
	\| MacBook Pro M4 Max \| 128 GB \| 3-bit \|

	### Performance Notes

	- Inference Speed: Expect ~5-15 tokens/sec depending on quantization and hardware
	- First Token Latency: 10-30 seconds for model loading
	- Context Window: Full 128K context supported
	- Active Parameters: Only ~37B parameters active per token (MoE architecture)


	## Installation

	```bash
	pip install mlx-lm
	```

	### Your First Generation (3 lines of code!)

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
	print(generate(model, tokenizer, prompt="Explain quantum entanglement simply:", max_tokens=200))
	```

	That's it! 🎉

	## 💻 System Requirements

	\| Component \| Minimum \| Recommended \|
	\|-----------\|---------\|-------------\|
	\| Mac \| M1 or newer \| M2 Ultra / M3 Max / M4 Max+ \|
	\| Memory \| 64 GB unified \| 128 GB+ unified \|
	\| Storage \| 1.1 TB free \| Fast SSD (2+ TB) \|
	\| macOS \| 12.0+ \| Latest version \|

	> ⚠️ Note: This is a HUGE model! Make sure you have enough RAM and storage.

	## 📚 Usage Examples

	### Command Line Interface

	```bash
	mlx_lm.generate \
	--model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \
	--prompt "Write a Python script to analyze CSV files." \
	--max-tokens 500
	```

	### Chat Conversation

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")

	conversation = """<\|im_start\|>system
	You are a helpful AI assistant specialized in coding and problem-solving.<\|im_end\|>
	<\|im_start\|>user
	Can you help me optimize this Python code?<\|im_end\|>
	<\|im_start\|>assistant
	"""

	response = generate(model, tokenizer, prompt=conversation, max_tokens=500)
	print(response)
	```

	### Advanced: Streaming Output

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")

	for token in generate(
	model,
	tokenizer,
	prompt="Tell me about the future of AI:",
	max_tokens=500,
	stream=True
	):
	print(token, end="", flush=True)
	```

	## 🏗️ Architecture Highlights

	<details>
	<summary><b>Click to expand technical details</b></summary>

	### Model Specifications

	\| Feature \| Value \|
	\|---------\|-------\|
	\| Total Parameters \| ~671 Billion \|
	\| Architecture \| DeepSeek V3 (MoE) \|
	\| Experts \| 384 routed + 1 shared \|
	\| Active Experts \| 8 per token \|
	\| Hidden Size \| 7168 \|
	\| Layers \| 61 \|
	\| Heads \| 56 \|
	\| Context Length \| 262,144 tokens \|
	\| Quantization \| 8.501 bits per weight \|

	### Advanced Features

	- 🎯 YaRN Rope Scaling - 64x factor for extended context
	- 🗜️ KV Compression - LoRA-based (rank 512)
	- ⚡ Query Compression - Q-LoRA (rank 1536)
	- 🧮 MoE Routing - Top-8 expert selection with sigmoid scoring
	- 🔧 FP8 Training - Pre-quantized with e4m3 precision

	</details>

	## 🎨 Other Quantization Options

	Choose the right balance for your needs:

	\| Quantization \| Size \| Quality \| Speed \| Best For \|
	\|--------------\|------\|---------\|-------\|----------\|
	\| 8-bit (you are here) \| ~1 TB \| ⭐⭐⭐⭐⭐ \| ⭐⭐⭐ \| Production, best quality \|
	\| [6-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit) \| ~800 GB \| ⭐⭐⭐⭐ \| ⭐⭐⭐⭐ \| Sweet spot for most users \|
	\| [4-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-4bit) \| ~570 GB \| ⭐⭐⭐ \| ⭐⭐⭐⭐⭐ \| Faster inference \|
	\| [2-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-2bit) \| ~320 GB \| ⭐⭐ \| ⭐⭐⭐⭐⭐ \| Experimental \|
	\| [Original](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) \| ~5 TB \| ⭐⭐⭐⭐⭐ \| ⭐⭐ \| Research only \|

	## 🔧 How It Was Made

	This model was quantized using MLX's built-in quantization:

	```bash
	mlx_lm.convert \
	--hf-path moonshotai/Kimi-K2-Instruct-0905 \
	--mlx-path Kimi-K2-Instruct-0905-MLX-8bit \
	-q --q-bits 8 \
	--trust-remote-code
	```

	Result: 8.501 bits per weight (includes metadata overhead)

	## ⚡ Performance Tips

	<details>
	<summary><b>Getting the best performance</b></summary>

	1. Close other applications - Free up as much RAM as possible
	2. Use an external SSD - If your internal drive is full
	3. Monitor memory - Watch Activity Monitor during inference
	4. Adjust batch size - If you get OOM errors, reduce max_tokens
	5. Keep your Mac cool - Good airflow helps maintain peak performance

	</details>

	## ⚠️ Known Limitations

	- 🍎 Apple Silicon Only - Won't work on Intel Macs or NVIDIA GPUs
	- 💾 Huge Storage Needs - Make sure you have 1.1 TB+ free
	- 🐏 RAM Intensive - Needs 64+ GB unified memory minimum
	- 🐌 Slower on M1 - Best performance on M2 Ultra or newer
	- 🌐 Bilingual Focus - Optimized for English and Chinese

	## 📄 License

	Apache 2.0 - Same as the original model. Free for commercial use!

	## 🙏 Acknowledgments

	- Original Model: [Moonshot AI](https://www.moonshot.cn/) for creating Kimi K2
	- Framework: Apple's [MLX team](https://github.com/ml-explore/mlx) for the amazing framework
	- Inspiration: DeepSeek V3 architecture

	## 📚 Citation

	If you use this model in your research or product, please cite:

	```bibtex
	@misc{kimi-k2-2025,
	title={Kimi K2: Advancing Long-Context Language Models},
	author={Moonshot AI},
	year={2025},
	url={https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905}
	}
	```

	## 🔗 Useful Links

	- 📦 Original Model: [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
	- 🛠️ MLX Framework: [GitHub](https://github.com/ml-explore/mlx)
	- 📖 MLX LM Docs: [GitHub](https://github.com/ml-explore/mlx-examples/tree/main/llms)
	- 💬 Discussions: [Ask questions here!](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit/discussions)

	---

	<div align="center">

	Quantized with ❤️ by [richardyoung](https://deepneuro.ai/richard)

	If you find this useful, please ⭐ star the repo and share with others!

	Created: October 2025 \| Format: MLX 8-bit

	</div>