Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +151 -98

README.md CHANGED Viewed

@@ -9,82 +9,88 @@ tags:
   - moe
   - instruction-following
   - 8-bit
 model_type: kimi_k2
 pipeline_tag: text-generation
 ---
-# Kimi-K2-Instruct-0905 MLX 8-bit
-MLX 8-bit quantized version of [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905), a state-of-the-art instruction-following language model based on DeepSeek V3 architecture.
-## Model Details
-**Architecture:** DeepSeek V3 (Kimi K2)
-- **Parameters:** ~671B total (Mixture of Experts)
-  - 384 routed experts
-  - 8 experts per token
-  - 1 shared expert
-- **Hidden Size:** 7168
-- **Layers:** 61
-- **Context Length:** 262,144 tokens
-- **Quantization:** MLX 8-bit (8.501 bits per weight)
-- **Size:** 1.0 TB
-- **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
-## Features
-- Long context support (262K tokens)
-- Advanced Mixture of Experts (MoE) architecture with 384 experts
-- Optimized for Apple Silicon with MLX framework
-- High-quality 8-bit quantization maintains excellent performance
-- Instruction-following and multi-turn conversation capabilities
-- Native Metal acceleration on M1/M2/M3/M4 Macs
-## Installation
 ```bash
 pip install mlx-lm
 ```
-## Usage
-### Python API
 ```python
 from mlx_lm import load, generate
-# Load the model
 model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
-# Generate text
-prompt = "Explain quantum computing in simple terms."
-response = generate(model, tokenizer, prompt=prompt, max_tokens=500)
-print(response)
 ```
-### Command Line
 ```bash
 mlx_lm.generate \
   --model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \
-  --prompt "Write a Python function to calculate Fibonacci numbers." \
   --max-tokens 500
 ```
-### Chat Format
-The model uses the ChatML format:
-```
-<|im_start|>system
-You are a helpful assistant.<|im_end|>
-<|im_start|>user
-{user message}<|im_end|>
-<|im_start|>assistant
-{assistant response}<|im_end|>
-```
-### Multi-turn Conversation Example
 ```python
 from mlx_lm import load, generate
@@ -92,55 +98,75 @@ from mlx_lm import load, generate
 model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
 conversation = """<|im_start|>system
-You are a helpful coding assistant.<|im_end|>
 <|im_start|>user
-Write a Python function to reverse a string.<|im_end|>
 <|im_start|>assistant
 """
-response = generate(model, tokenizer, prompt=conversation, max_tokens=300)
 print(response)
 ```
-## System Requirements
-**Minimum:**
-- 1.1 TB free disk space
-- 64 GB RAM (unified memory)
-- Apple Silicon Mac (M1 or later)
-- macOS 12.0 or later
-**Recommended:**
-- 128 GB+ unified memory
-- M2 Ultra, M3 Max, or M4 Max/Ultra
-- Fast SSD storage
-## Performance Notes
-- **Memory Usage:** ~1 TB model size + ~20-40 GB runtime overhead
-- **Inference Speed:** Depends on hardware (faster on M2 Ultra/M3 Max)
-- **Quantization:** 8-bit quantization maintains near-original model quality
-- **MoE Efficiency:** Only 8 experts activated per token (not all 384)
-## Model Variants
-If you need different quantization levels or formats:
-- **MLX 6-bit** (coming soon): `richardyoung/Kimi-K2-Instruct-0905-MLX-6bit`
-- **MLX 4-bit** (coming soon): `richardyoung/Kimi-K2-Instruct-0905-MLX-4bit`
-- **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
-## Limitations
-- Requires Apple Silicon (not compatible with x86/CUDA)
-- Very large model size (1 TB) requires significant storage
-- High memory requirements (64+ GB unified memory)
-- Inference speed depends heavily on available RAM and SSD speed
-- Chinese-English bilingual model, optimized for both languages
-## Technical Details
-### Quantization Method
 This model was quantized using MLX's built-in quantization:
@@ -148,22 +174,46 @@ This model was quantized using MLX's built-in quantization:
 mlx_lm.convert \
   --hf-path moonshotai/Kimi-K2-Instruct-0905 \
   --mlx-path Kimi-K2-Instruct-0905-MLX-8bit \
-  -q --q-bits 8 --trust-remote-code
 ```
-**Result:** 8.501 bits per weight (slightly higher than 8-bit due to metadata)
-### Architecture Highlights
-- **Rope Scaling:** YaRN with 64x factor for extended context
-- **KV Compression:** LoRA-based key-value compression (rank 512)
-- **Query Compression:** Q-LoRA rank 1536
-- **MoE Routing:** Top-8 expert selection with sigmoid scoring
-- **Training:** Pre-quantized with FP8 (e4m3) in base model
-## Citation
-If you use this model, please cite the original Kimi K2 work:
 ```bibtex
 @misc{kimi-k2-2025,
@@ -174,18 +224,21 @@ If you use this model, please cite the original Kimi K2 work:
 }
 ```
-## License
-Same as base model: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-## Links
-- **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
-- **MLX Framework:** [GitHub](https://github.com/ml-explore/mlx)
-- **MLX LM:** [GitHub](https://github.com/ml-explore/mlx-examples/tree/main/llms)
----
-**Quantized by:** richardyoung
-**Format:** MLX 8-bit
-**Created:** 2025-10-25

   - moe
   - instruction-following
   - 8-bit
+  - apple-silicon
 model_type: kimi_k2
 pipeline_tag: text-generation
+language:
+- en
+- zh
+library_name: mlx
 ---
+<div align="center">
+# 🌙 Kimi K2 Instruct - MLX 8-bit
+### State-of-the-Art 671B MoE Model, Optimized for Apple Silicon
+[![MLX](https://img.shields.io/badge/MLX-Optimized-blue?logo=apple)](https://github.com/ml-explore/mlx)
+[![Model Size](https://img.shields.io/badge/Size-1.0_TB-green)](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit)
+[![Quantization](https://img.shields.io/badge/Quantization-8--bit-orange)](https://github.com/ml-explore/mlx)
+[![Context](https://img.shields.io/badge/Context-262K_tokens-purple)](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
+[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+**[Original Model](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)** | **[MLX Framework](https://github.com/ml-explore/mlx)** | **[More Quantizations](#-other-quantization-options)**
+---
+</div>
+## 📖 What is This?
+This is a **high-quality 8-bit quantized version** of Kimi K2 Instruct, optimized to run on **Apple Silicon** (M1/M2/M3/M4) Macs using the MLX framework. Think of it as taking a massive 671-billion parameter AI model and compressing it down to ~1 TB while keeping almost all of its intelligence intact!
+### ✨ Why You'll Love It
+- 🚀 **Massive Context Window** - Handle up to 262,144 tokens (~200,000 words!)
+- 🧠 **671B Parameters** - One of the most capable open models available
+- ⚡ **Apple Silicon Native** - Fully optimized for M-series chips with Metal acceleration
+- 🎯 **8-bit Precision** - Best quality-to-size ratio for serious work
+- 🌏 **Bilingual** - Fluent in both English and Chinese
+- 💬 **Instruction-Tuned** - Ready for conversations, coding, analysis, and more
+## 🎯 Quick Start
+### Installation
 ```bash
 pip install mlx-lm
 ```
+### Your First Generation (3 lines of code!)
 ```python
 from mlx_lm import load, generate
 model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
+print(generate(model, tokenizer, prompt="Explain quantum entanglement simply:", max_tokens=200))
 ```
+That's it! 🎉
+## 💻 System Requirements
+| Component | Minimum | Recommended |
+|-----------|---------|-------------|
+| **Mac** | M1 or newer | M2 Ultra / M3 Max / M4 Max+ |
+| **Memory** | 64 GB unified | 128 GB+ unified |
+| **Storage** | 1.1 TB free | Fast SSD (2+ TB) |
+| **macOS** | 12.0+ | Latest version |
+> ⚠️ **Note:** This is a HUGE model! Make sure you have enough RAM and storage.
+## 📚 Usage Examples
+### Command Line Interface
 ```bash
 mlx_lm.generate \
   --model richardyoung/Kimi-K2-Instruct-0905-MLX-8bit \
+  --prompt "Write a Python script to analyze CSV files." \
   --max-tokens 500
 ```
+### Chat Conversation
 ```python
 from mlx_lm import load, generate
 model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
 conversation = """<|im_start|>system
+You are a helpful AI assistant specialized in coding and problem-solving.<|im_end|>
 <|im_start|>user
+Can you help me optimize this Python code?<|im_end|>
 <|im_start|>assistant
 """
+response = generate(model, tokenizer, prompt=conversation, max_tokens=500)
 print(response)
 ```
+### Advanced: Streaming Output
+```python
+from mlx_lm import load, generate
+model, tokenizer = load("richardyoung/Kimi-K2-Instruct-0905-MLX-8bit")
+for token in generate(
+    model,
+    tokenizer,
+    prompt="Tell me about the future of AI:",
+    max_tokens=500,
+    stream=True
+):
+    print(token, end="", flush=True)
+```
+## 🏗️ Architecture Highlights
+<details>
+<summary><b>Click to expand technical details</b></summary>
+### Model Specifications
+| Feature | Value |
+|---------|-------|
+| **Total Parameters** | ~671 Billion |
+| **Architecture** | DeepSeek V3 (MoE) |
+| **Experts** | 384 routed + 1 shared |
+| **Active Experts** | 8 per token |
+| **Hidden Size** | 7168 |
+| **Layers** | 61 |
+| **Heads** | 56 |
+| **Context Length** | 262,144 tokens |
+| **Quantization** | 8.501 bits per weight |
+### Advanced Features
+- **🎯 YaRN Rope Scaling** - 64x factor for extended context
+- **🗜️ KV Compression** - LoRA-based (rank 512)
+- **⚡ Query Compression** - Q-LoRA (rank 1536)
+- **🧮 MoE Routing** - Top-8 expert selection with sigmoid scoring
+- **🔧 FP8 Training** - Pre-quantized with e4m3 precision
+</details>
+## 🎨 Other Quantization Options
+Choose the right balance for your needs:
+| Quantization | Size | Quality | Speed | Best For |
+|--------------|------|---------|-------|----------|
+| **8-bit** (you are here) | ~1 TB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | Production, best quality |
+| [6-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-6bit) | ~800 GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Sweet spot for most users |
+| [4-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-4bit) | ~570 GB | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | Faster inference |
+| [2-bit](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-2bit) | ~320 GB | ⭐⭐ | ⭐⭐⭐⭐⭐ | Experimental |
+| [Original](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) | ~5 TB | ⭐⭐⭐⭐⭐ | ⭐⭐ | Research only |
+## 🔧 How It Was Made
 This model was quantized using MLX's built-in quantization:
 mlx_lm.convert \
   --hf-path moonshotai/Kimi-K2-Instruct-0905 \
   --mlx-path Kimi-K2-Instruct-0905-MLX-8bit \
+  -q --q-bits 8 \
+  --trust-remote-code
 ```
+**Result:** 8.501 bits per weight (includes metadata overhead)
+## ⚡ Performance Tips
+<details>
+<summary><b>Getting the best performance</b></summary>
+1. **Close other applications** - Free up as much RAM as possible
+2. **Use an external SSD** - If your internal drive is full
+3. **Monitor memory** - Watch Activity Monitor during inference
+4. **Adjust batch size** - If you get OOM errors, reduce max_tokens
+5. **Keep your Mac cool** - Good airflow helps maintain peak performance
+</details>
+## ⚠️ Known Limitations
+- 🍎 **Apple Silicon Only** - Won't work on Intel Macs or NVIDIA GPUs
+- 💾 **Huge Storage Needs** - Make sure you have 1.1 TB+ free
+- 🐏 **RAM Intensive** - Needs 64+ GB unified memory minimum
+- 🐌 **Slower on M1** - Best performance on M2 Ultra or newer
+- 🌐 **Bilingual Focus** - Optimized for English and Chinese
+## 📄 License
+Apache 2.0 - Same as the original model. Free for commercial use!
+## 🙏 Acknowledgments
+- **Original Model:** [Moonshot AI](https://www.moonshot.cn/) for creating Kimi K2
+- **Framework:** Apple's [MLX team](https://github.com/ml-explore/mlx) for the amazing framework
+- **Inspiration:** DeepSeek V3 architecture
+## 📚 Citation
+If you use this model in your research or product, please cite:
 ```bibtex
 @misc{kimi-k2-2025,
 }
 ```
+## 🔗 Useful Links
+- 📦 **Original Model:** [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905)
+- 🛠️ **MLX Framework:** [GitHub](https://github.com/ml-explore/mlx)
+- 📖 **MLX LM Docs:** [GitHub](https://github.com/ml-explore/mlx-examples/tree/main/llms)
+- 💬 **Discussions:** [Ask questions here!](https://huggingface.co/richardyoung/Kimi-K2-Instruct-0905-MLX-8bit/discussions)
+---
+<div align="center">
+**Quantized with ❤️ by richardyoung**
+*If you find this useful, please ⭐ star the repo and share with others!*
+**Created:** October 2025 | **Format:** MLX 8-bit
+</div>