Instructions to use MiniMaxAI/MiniMax-M1-40k-hf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MiniMaxAI/MiniMax-M1-40k-hf with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MiniMaxAI/MiniMax-M1-40k-hf")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-M1-40k-hf")
model = AutoModelForCausalLM.from_pretrained("MiniMaxAI/MiniMax-M1-40k-hf")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use MiniMaxAI/MiniMax-M1-40k-hf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MiniMaxAI/MiniMax-M1-40k-hf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MiniMaxAI/MiniMax-M1-40k-hf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/MiniMaxAI/MiniMax-M1-40k-hf

SGLang

How to use MiniMaxAI/MiniMax-M1-40k-hf with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MiniMaxAI/MiniMax-M1-40k-hf" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MiniMaxAI/MiniMax-M1-40k-hf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MiniMaxAI/MiniMax-M1-40k-hf" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MiniMaxAI/MiniMax-M1-40k-hf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use MiniMaxAI/MiniMax-M1-40k-hf with Docker Model Runner:
```
docker model run hf.co/MiniMaxAI/MiniMax-M1-40k-hf
```

MiniMax-M1-40k-hf / docs /vllm_deployment_guide.md

QscQ

update

9938d9d 11 months ago

preview code

raw

history blame contribute delete

5.35 kB

🚀 MiniMax Models vLLM Deployment Guide

vLLM中文版部署指南

📖 Introduction

We recommend using vLLM to deploy MiniMax-M1 model. Based on our testing, vLLM performs excellently when deploying this model, with the following features:

🔥 Outstanding service throughput performance
⚡ Efficient and intelligent memory management
📦 Powerful batch request processing capability
⚙️ Deeply optimized underlying performance

The MiniMax-M1 model can run efficiently on a single server equipped with 8 H800 or 8 H20 GPUs. In terms of hardware configuration, a server with 8 H800 GPUs can process context inputs up to 2 million tokens, while a server equipped with 8 H20 GPUs can support ultra-long context processing capabilities of up to 5 million tokens.

💾 Obtaining MiniMax Models

MiniMax-M1 Model Obtaining

You can download the model from our official HuggingFace repository: MiniMax-M1-40k, MiniMax-M1-80k

Download command:

pip install -U huggingface-hub
huggingface-cli download MiniMaxAI/MiniMax-M1-40k
# huggingface-cli download MiniMaxAI/MiniMax-M1-80k

# If you encounter network issues, you can set a proxy
export HF_ENDPOINT=https://hf-mirror.com

Or download using git:

git lfs install
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-40k
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-80k

⚠️ Important Note: Please ensure that Git LFS is installed on your system, which is necessary for completely downloading the model weight files.

🛠️ Deployment Options

Option 1: Deploy Using Docker (Recommended)

To ensure consistency and stability of the deployment environment, we recommend using Docker for deployment.

⚠️ Version Requirements:

MiniMax-M1 model requires vLLM version 0.8.3 or later for full support
If you are using a Docker image with vLLM version lower than the required version, you will need to:
1. Update to the latest vLLM code
2. Recompile vLLM from source. Follow the compilation instructions in Solution 2 of the Common Issues section
Special Note: For vLLM versions between 0.8.3 and 0.9.2, you need to modify the model configuration:
1. Open config.json
2. Change config['architectures'] = ["MiniMaxM1ForCausalLM"] to config['architectures'] = ["MiniMaxText01ForCausalLM"]

Get the container image:

docker pull vllm/vllm-openai:v0.8.3

Run the container:

# Set environment variables
IMAGE=vllm/vllm-openai:v0.8.3
MODEL_DIR=<model storage path>
CODE_DIR=<code path>
NAME=MiniMaxImage

# Docker run configuration
DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"

# Start the container
sudo docker run -it \
    -v $MODEL_DIR:$MODEL_DIR \
    -v $CODE_DIR:$CODE_DIR \
    --name $NAME \
    $DOCKER_RUN_CMD \
    $IMAGE /bin/bash

Option 2: Direct Installation of vLLM

If your environment meets the following requirements:

CUDA 12.1
PyTorch 2.1

You can directly install vLLM

Installation command:

pip install vllm

💡 If you are using other environment configurations, please refer to the vLLM Installation Guide

🚀 Starting the Service

Launch MiniMax-M1 Service

export SAFETENSORS_FAST_GPU=1
export VLLM_USE_V1=0
python3 -m vllm.entrypoints.openai.api_server \
--model <model storage path> \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8  \
--max_model_len 4096 \
--dtype bfloat16

API Call Example

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "MiniMaxAI/MiniMax-M1",
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
            {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
        ]
    }'

❗ Common Issues

Module Loading Problems

If you encounter the following error:

import vllm._C  # noqa
ModuleNotFoundError: No module named 'vllm._C'

MiniMax-M1 model is not currently supported

We provide two solutions:

Solution 1: Copy Dependency Files

cd <working directory>
git clone https://github.com/vllm-project/vllm.git
cd vllm
cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm 
cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn

Solution 2: Install from Source

cd <working directory>
git clone https://github.com/vllm-project/vllm.git

cd vllm/
pip install -e .

📮 Getting Support

If you encounter any issues while deploying MiniMax-M1 model:

Please check our official documentation
Contact our technical support team through official channels
Submit an Issue on our GitHub repository

We will continuously optimize the deployment experience of this model and welcome your feedback!