richardyoung commited on
Commit
68f9b5c
·
verified ·
1 Parent(s): 71034ca

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +151 -98
README.md CHANGED
@@ -9,82 +9,88 @@ tags:
9
  - moe
10
  - instruction-following
11
  - 8-bit
 
12
  model_type: kimi_k2
13
  pipeline_tag: text-generation
 
 
 
 
14
  ---
15
 
16
- # Kimi-K2-Instruct-0905 MLX 8-bit
17
 
18
- MLX 8-bit quantized version of [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905), a state-of-the-art instruction-following language model based on DeepSeek V3 architecture.
19
 
20
- ## Model Details
21
 
22
- **Architecture:** DeepSeek V3 (Kimi K2)
23
- - **Parameters:** ~671B total (Mixture of Experts)
24
- - 384 routed experts
25
- - 8 experts per token
26
- - 1 shared expert
27
- - **Hidden Size:** 7168
28
- - **Layers:** 61
29
- - **Context Length:** 262,144 tokens
30
- - **Quantization:** MLX 8-bit (8.501 bits per weight)
31
- - **Size:** 1.0 TB
32
- - **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
33
 
34
- ## Features
35
 
36
- - Long context support (262K tokens)
37
- - Advanced Mixture of Experts (MoE) architecture with 384 experts
38
- - Optimized for Apple Silicon with MLX framework
39
- - High-quality 8-bit quantization maintains excellent performance
40
- - Instruction-following and multi-turn conversation capabilities
41
- - Native Metal acceleration on M1/M2/M3/M4 Macs
 
 
 
 
 
 
 
 
 
 
42
 
43
- ## Installation
 
 
44
 
45
  ```bash
46
  pip install mlx-lm
47
  ```
48
 
49
- ## Usage
50
-
51
- ### Python API
52
 
53
  ```python
54
  from mlx_lm import load, generate
55
 
56
- # Load the model
57
  model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
58
-
59
- # Generate text
60
- prompt = "Explain quantum computing in simple terms."
61
- response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
62
- print(response)
63
  ```
64
 
65
- ### Command Line
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ```bash
68
  mlx_lm.generate \
69
  --model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \
70
- --prompt "Write a Python function to calculate Fibonacci numbers." \
71
  --max-tokens 500
72
  ```
73
 
74
- ### Chat Format
75
-
76
- The model uses the ChatML format:
77
-
78
- ```
79
- <|im_start|>system
80
- You are a helpful assistant.<|im_end|>
81
- <|im_start|>user
82
- {user message}<|im_end|>
83
- <|im_start|>assistant
84
- {assistant response}<|im_end|>
85
- ```
86
-
87
- ### Multi-turn Conversation Example
88
 
89
  ```python
90
  from mlx_lm import load, generate
@@ -92,55 +98,75 @@ from mlx_lm import load, generate
92
  model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
93
 
94
  conversation = """<|im_start|>system
95
- You are a helpful coding assistant.<|im_end|>
96
  <|im_start|>user
97
- Write a Python function to reverse a string.<|im_end|>
98
  <|im_start|>assistant
99
  """
100
 
101
- response = generate(model, tokenizer, prompt=conversation, max_tokens=300)
102
  print(response)
103
  ```
104
 
105
- ## System Requirements
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
- **Minimum:**
108
- - 1.1 TB free disk space
109
- - 64 GB RAM (unified memory)
110
- - Apple Silicon Mac (M1 or later)
111
- - macOS 12.0 or later
112
 
113
- **Recommended:**
114
- - 128 GB+ unified memory
115
- - M2 Ultra, M3 Max, or M4 Max/Ultra
116
- - Fast SSD storage
117
 
118
- ## Performance Notes
119
 
120
- - **Memory Usage:** ~1 TB model size + ~20-40 GB runtime overhead
121
- - **Inference Speed:** Depends on hardware (faster on M2 Ultra/M3 Max)
122
- - **Quantization:** 8-bit quantization maintains near-original model quality
123
- - **MoE Efficiency:** Only 8 experts activated per token (not all 384)
 
 
 
 
 
 
 
124
 
125
- ## Model Variants
126
 
127
- If you need different quantization levels or formats:
 
 
 
 
128
 
129
- - **MLX 6-bit** (coming soon): `richardyoung/Kimi-K2-Instruct-0905-MLX-6bit`
130
- - **MLX 4-bit** (coming soon): `richardyoung/Kimi-K2-Instruct-0905-MLX-4bit`
131
- - **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
132
 
133
- ## Limitations
134
 
135
- - Requires Apple Silicon (not compatible with x86/CUDA)
136
- - Very large model size (1 TB) requires significant storage
137
- - High memory requirements (64+ GB unified memory)
138
- - Inference speed depends heavily on available RAM and SSD speed
139
- - Chinese-English bilingual model, optimized for both languages
140
 
141
- ## Technical Details
 
 
 
 
 
 
142
 
143
- ### Quantization Method
144
 
145
  This model was quantized using MLX's built-in quantization:
146
 
@@ -148,22 +174,46 @@ This model was quantized using MLX's built-in quantization:
148
  mlx_lm.convert \
149
  --hf-path moonshotai/Kimi-K2-Instruct-0905 \
150
  --mlx-path Kimi-K2-Instruct-0905-MLX-8bit \
151
- -q --q-bits 8 --trust-remote-code
 
152
  ```
153
 
154
- **Result:** 8.501 bits per weight (slightly higher than 8-bit due to metadata)
 
 
 
 
 
155
 
156
- ### Architecture Highlights
 
 
 
 
157
 
158
- - **Rope Scaling:** YaRN with 64x factor for extended context
159
- - **KV Compression:** LoRA-based key-value compression (rank 512)
160
- - **Query Compression:** Q-LoRA rank 1536
161
- - **MoE Routing:** Top-8 expert selection with sigmoid scoring
162
- - **Training:** Pre-quantized with FP8 (e4m3) in base model
163
 
164
- ## Citation
165
 
166
- If you use this model, please cite the original Kimi K2 work:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
 
168
  ```bibtex
169
  @misc{kimi-k2-2025,
@@ -174,18 +224,21 @@ If you use this model, please cite the original Kimi K2 work:
174
  }
175
  ```
176
 
177
- ## License
178
 
179
- Same as base model: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 
 
 
180
 
181
- ## Links
182
 
183
- - **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
184
- - **MLX Framework:** [GitHub](https://github.com/ml-explore/mlx)
185
- - **MLX LM:** [GitHub](https://github.com/ml-explore/mlx-examples/tree/main/llms)
186
 
187
- ---
 
 
 
 
188
 
189
- **Quantized by:** richardyoung
190
- **Format:** MLX 8-bit
191
- **Created:** 2025-10-25
 
9
  - moe
10
  - instruction-following
11
  - 8-bit
12
+ - apple-silicon
13
  model_type: kimi_k2
14
  pipeline_tag: text-generation
15
+ language:
16
+ - en
17
+ - zh
18
+ library_name: mlx
19
  ---
20
 
21
+ <div align="center">
22
 
23
+ # 🌙 Kimi K2 Instruct - MLX 8-bit
24
 
25
+ ### State-of-the-Art 671B MoE Model, Optimized for Apple Silicon
26
 
27
+ [![MLX](https://img.shields.io/badge/MLX-Optimized-blue?logo=apple)](https://github.com/ml-explore/mlx)
28
+ [![Model Size](https://img.shields.io/badge/Size-1.0_TB-green)](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit)
29
+ [![Quantization](https://img.shields.io/badge/Quantization-8--bit-orange)](https://github.com/ml-explore/mlx)
30
+ [![Context](https://img.shields.io/badge/Context-262K_tokens-purple)](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
31
+ [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
 
 
 
 
 
 
32
 
33
+ **[Original Model](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)** | **[MLX Framework](https://github.com/ml-explore/mlx)** | **[More Quantizations](#-other-quantization-options)**
34
 
35
+ ---
36
+
37
+ </div>
38
+
39
+ ## 📖 What is This?
40
+
41
+ This is a **high-quality 8-bit quantized version** of Kimi K2 Instruct, optimized to run on **Apple Silicon** (M1/M2/M3/M4) Macs using the MLX framework. Think of it as taking a massive 671-billion parameter AI model and compressing it down to ~1 TB while keeping almost all of its intelligence intact!
42
+
43
+ ### ✨ Why You'll Love It
44
+
45
+ - 🚀 **Massive Context Window** - Handle up to 262,144 tokens (~200,000 words!)
46
+ - 🧠 **671B Parameters** - One of the most capable open models available
47
+ - ⚡ **Apple Silicon Native** - Fully optimized for M-series chips with Metal acceleration
48
+ - 🎯 **8-bit Precision** - Best quality-to-size ratio for serious work
49
+ - 🌏 **Bilingual** - Fluent in both English and Chinese
50
+ - 💬 **Instruction-Tuned** - Ready for conversations, coding, analysis, and more
51
 
52
+ ## 🎯 Quick Start
53
+
54
+ ### Installation
55
 
56
  ```bash
57
  pip install mlx-lm
58
  ```
59
 
60
+ ### Your First Generation (3 lines of code!)
 
 
61
 
62
  ```python
63
  from mlx_lm import load, generate
64
 
 
65
  model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
66
+ print(generate(model, tokenizer, prompt="Explain quantum entanglement simply:", max_tokens=200))
 
 
 
 
67
  ```
68
 
69
+ That's it! 🎉
70
+
71
+ ## 💻 System Requirements
72
+
73
+ | Component | Minimum | Recommended |
74
+ |-----------|---------|-------------|
75
+ | **Mac** | M1 or newer | M2 Ultra / M3 Max / M4 Max+ |
76
+ | **Memory** | 64 GB unified | 128 GB+ unified |
77
+ | **Storage** | 1.1 TB free | Fast SSD (2+ TB) |
78
+ | **macOS** | 12.0+ | Latest version |
79
+
80
+ > ⚠️ **Note:** This is a HUGE model! Make sure you have enough RAM and storage.
81
+
82
+ ## 📚 Usage Examples
83
+
84
+ ### Command Line Interface
85
 
86
  ```bash
87
  mlx_lm.generate \
88
  --model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \
89
+ --prompt "Write a Python script to analyze CSV files." \
90
  --max-tokens 500
91
  ```
92
 
93
+ ### Chat Conversation
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  ```python
96
  from mlx_lm import load, generate
 
98
  model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
99
 
100
  conversation = """<|im_start|>system
101
+ You are a helpful AI assistant specialized in coding and problem-solving.<|im_end|>
102
  <|im_start|>user
103
+ Can you help me optimize this Python code?<|im_end|>
104
  <|im_start|>assistant
105
  """
106
 
107
+ response = generate(model, tokenizer, prompt=conversation, max_tokens=500)
108
  print(response)
109
  ```
110
 
111
+ ### Advanced: Streaming Output
112
+
113
+ ```python
114
+ from mlx_lm import load, generate
115
+
116
+ model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
117
+
118
+ for token in generate(
119
+ model,
120
+ tokenizer,
121
+ prompt="Tell me about the future of AI:",
122
+ max_tokens=500,
123
+ stream=True
124
+ ):
125
+ print(token, end="", flush=True)
126
+ ```
127
 
128
+ ## 🏗️ Architecture Highlights
 
 
 
 
129
 
130
+ <details>
131
+ <summary><b>Click to expand technical details</b></summary>
 
 
132
 
133
+ ### Model Specifications
134
 
135
+ | Feature | Value |
136
+ |---------|-------|
137
+ | **Total Parameters** | ~671 Billion |
138
+ | **Architecture** | DeepSeek V3 (MoE) |
139
+ | **Experts** | 384 routed + 1 shared |
140
+ | **Active Experts** | 8 per token |
141
+ | **Hidden Size** | 7168 |
142
+ | **Layers** | 61 |
143
+ | **Heads** | 56 |
144
+ | **Context Length** | 262,144 tokens |
145
+ | **Quantization** | 8.501 bits per weight |
146
 
147
+ ### Advanced Features
148
 
149
+ - **🎯 YaRN Rope Scaling** - 64x factor for extended context
150
+ - **🗜️ KV Compression** - LoRA-based (rank 512)
151
+ - **⚡ Query Compression** - Q-LoRA (rank 1536)
152
+ - **🧮 MoE Routing** - Top-8 expert selection with sigmoid scoring
153
+ - **🔧 FP8 Training** - Pre-quantized with e4m3 precision
154
 
155
+ </details>
 
 
156
 
157
+ ## 🎨 Other Quantization Options
158
 
159
+ Choose the right balance for your needs:
 
 
 
 
160
 
161
+ | Quantization | Size | Quality | Speed | Best For |
162
+ |--------------|------|---------|-------|----------|
163
+ | **8-bit** (you are here) | ~1 TB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | Production, best quality |
164
+ | [6-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit) | ~800 GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Sweet spot for most users |
165
+ | [4-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-4bit) | ~570 GB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Faster inference |
166
+ | [2-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-2bit) | ~320 GB | ⭐⭐ | ⭐⭐⭐⭐⭐ | Experimental |
167
+ | [Original](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) | ~5 TB | ⭐⭐⭐⭐⭐ | ⭐⭐ | Research only |
168
 
169
+ ## 🔧 How It Was Made
170
 
171
  This model was quantized using MLX's built-in quantization:
172
 
 
174
  mlx_lm.convert \
175
  --hf-path moonshotai/Kimi-K2-Instruct-0905 \
176
  --mlx-path Kimi-K2-Instruct-0905-MLX-8bit \
177
+ -q --q-bits 8 \
178
+ --trust-remote-code
179
  ```
180
 
181
+ **Result:** 8.501 bits per weight (includes metadata overhead)
182
+
183
+ ## ⚡ Performance Tips
184
+
185
+ <details>
186
+ <summary><b>Getting the best performance</b></summary>
187
 
188
+ 1. **Close other applications** - Free up as much RAM as possible
189
+ 2. **Use an external SSD** - If your internal drive is full
190
+ 3. **Monitor memory** - Watch Activity Monitor during inference
191
+ 4. **Adjust batch size** - If you get OOM errors, reduce max_tokens
192
+ 5. **Keep your Mac cool** - Good airflow helps maintain peak performance
193
 
194
+ </details>
 
 
 
 
195
 
196
+ ## ⚠️ Known Limitations
197
 
198
+ - 🍎 **Apple Silicon Only** - Won't work on Intel Macs or NVIDIA GPUs
199
+ - 💾 **Huge Storage Needs** - Make sure you have 1.1 TB+ free
200
+ - 🐏 **RAM Intensive** - Needs 64+ GB unified memory minimum
201
+ - 🐌 **Slower on M1** - Best performance on M2 Ultra or newer
202
+ - 🌐 **Bilingual Focus** - Optimized for English and Chinese
203
+
204
+ ## 📄 License
205
+
206
+ Apache 2.0 - Same as the original model. Free for commercial use!
207
+
208
+ ## 🙏 Acknowledgments
209
+
210
+ - **Original Model:** [Moonshot AI](https://www.moonshot.cn/) for creating Kimi K2
211
+ - **Framework:** Apple's [MLX team](https://github.com/ml-explore/mlx) for the amazing framework
212
+ - **Inspiration:** DeepSeek V3 architecture
213
+
214
+ ## 📚 Citation
215
+
216
+ If you use this model in your research or product, please cite:
217
 
218
  ```bibtex
219
  @misc{kimi-k2-2025,
 
224
  }
225
  ```
226
 
227
+ ## 🔗 Useful Links
228
 
229
+ - 📦 **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
230
+ - 🛠️ **MLX Framework:** [GitHub](https://github.com/ml-explore/mlx)
231
+ - 📖 **MLX LM Docs:** [GitHub](https://github.com/ml-explore/mlx-examples/tree/main/llms)
232
+ - 💬 **Discussions:** [Ask questions here!](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit/discussions)
233
 
234
+ ---
235
 
236
+ <div align="center">
 
 
237
 
238
+ **Quantized with ❤️ by richardyoung**
239
+
240
+ *If you find this useful, please ⭐ star the repo and share with others!*
241
+
242
+ **Created:** October 2025 | **Format:** MLX 8-bit
243
 
244
+ </div>