Instructions to use MiniMaxAI/MiniMax-M1-40k-hf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use MiniMaxAI/MiniMax-M1-40k-hf with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="MiniMaxAI/MiniMax-M1-40k-hf") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("MiniMaxAI/MiniMax-M1-40k-hf") model = AutoModelForCausalLM.from_pretrained("MiniMaxAI/MiniMax-M1-40k-hf") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use MiniMaxAI/MiniMax-M1-40k-hf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "MiniMaxAI/MiniMax-M1-40k-hf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MiniMaxAI/MiniMax-M1-40k-hf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/MiniMaxAI/MiniMax-M1-40k-hf
- SGLang
How to use MiniMaxAI/MiniMax-M1-40k-hf with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "MiniMaxAI/MiniMax-M1-40k-hf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MiniMaxAI/MiniMax-M1-40k-hf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "MiniMaxAI/MiniMax-M1-40k-hf" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "MiniMaxAI/MiniMax-M1-40k-hf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use MiniMaxAI/MiniMax-M1-40k-hf with Docker Model Runner:
docker model run hf.co/MiniMaxAI/MiniMax-M1-40k-hf
๐ MiniMax Models vLLM Deployment Guide
๐ Introduction
We recommend using vLLM to deploy MiniMax-M1 model. Based on our testing, vLLM performs excellently when deploying this model, with the following features:
- ๐ฅ Outstanding service throughput performance
- โก Efficient and intelligent memory management
- ๐ฆ Powerful batch request processing capability
- โ๏ธ Deeply optimized underlying performance
The MiniMax-M1 model can run efficiently on a single server equipped with 8 H800 or 8 H20 GPUs. In terms of hardware configuration, a server with 8 H800 GPUs can process context inputs up to 2 million tokens, while a server equipped with 8 H20 GPUs can support ultra-long context processing capabilities of up to 5 million tokens.
๐พ Obtaining MiniMax Models
MiniMax-M1 Model Obtaining
You can download the model from our official HuggingFace repository: MiniMax-M1-40k, MiniMax-M1-80k
Download command:
pip install -U huggingface-hub
huggingface-cli download MiniMaxAI/MiniMax-M1-40k
# huggingface-cli download MiniMaxAI/MiniMax-M1-80k
# If you encounter network issues, you can set a proxy
export HF_ENDPOINT=https://hf-mirror.com
Or download using git:
git lfs install
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-40k
git clone https://huggingface.co/MiniMaxAI/MiniMax-M1-80k
โ ๏ธ Important Note: Please ensure that Git LFS is installed on your system, which is necessary for completely downloading the model weight files.
๐ ๏ธ Deployment Options
Option 1: Deploy Using Docker (Recommended)
To ensure consistency and stability of the deployment environment, we recommend using Docker for deployment.
โ ๏ธ Version Requirements:
- MiniMax-M1 model requires vLLM version 0.8.3 or later for full support
- If you are using a Docker image with vLLM version lower than the required version, you will need to:
- Update to the latest vLLM code
- Recompile vLLM from source. Follow the compilation instructions in Solution 2 of the Common Issues section
- Special Note: For vLLM versions between 0.8.3 and 0.9.2, you need to modify the model configuration:
- Open
config.json - Change
config['architectures'] = ["MiniMaxM1ForCausalLM"]toconfig['architectures'] = ["MiniMaxText01ForCausalLM"]
- Open
- Get the container image:
docker pull vllm/vllm-openai:v0.8.3
- Run the container:
# Set environment variables
IMAGE=vllm/vllm-openai:v0.8.3
MODEL_DIR=<model storage path>
CODE_DIR=<code path>
NAME=MiniMaxImage
# Docker run configuration
DOCKER_RUN_CMD="--network=host --privileged --ipc=host --ulimit memlock=-1 --shm-size=2gb --rm --gpus all --ulimit stack=67108864"
# Start the container
sudo docker run -it \
-v $MODEL_DIR:$MODEL_DIR \
-v $CODE_DIR:$CODE_DIR \
--name $NAME \
$DOCKER_RUN_CMD \
$IMAGE /bin/bash
Option 2: Direct Installation of vLLM
If your environment meets the following requirements:
- CUDA 12.1
- PyTorch 2.1
You can directly install vLLM
Installation command:
pip install vllm
๐ก If you are using other environment configurations, please refer to the vLLM Installation Guide
๐ Starting the Service
Launch MiniMax-M1 Service
export SAFETENSORS_FAST_GPU=1
export VLLM_USE_V1=0
python3 -m vllm.entrypoints.openai.api_server \
--model <model storage path> \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8 \
--max_model_len 4096 \
--dtype bfloat16
API Call Example
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MiniMaxAI/MiniMax-M1",
"messages": [
{"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
{"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
]
}'
โ Common Issues
Module Loading Problems
If you encounter the following error:
import vllm._C # noqa
ModuleNotFoundError: No module named 'vllm._C'
Or
MiniMax-M1 model is not currently supported
We provide two solutions:
Solution 1: Copy Dependency Files
cd <working directory>
git clone https://github.com/vllm-project/vllm.git
cd vllm
cp /usr/local/lib/python3.12/dist-packages/vllm/*.so vllm
cp -r /usr/local/lib/python3.12/dist-packages/vllm/vllm_flash_attn/* vllm/vllm_flash_attn
Solution 2: Install from Source
cd <working directory>
git clone https://github.com/vllm-project/vllm.git
cd vllm/
pip install -e .
๐ฎ Getting Support
If you encounter any issues while deploying MiniMax-M1 model:
- Please check our official documentation
- Contact our technical support team through official channels
- Submit an Issue on our GitHub repository
We will continuously optimize the deployment experience of this model and welcome your feedback!