lukealonso
/

MiniMax-M2-NVFP4

@@ -7,50 +7,141 @@ modelopt NVFP4 quantized MiniMax-M2
 Instructions from another user, running on RTX Pro 6000 Blackwell:
-```
-export VLLM_ATTENTION_BACKEND=FLASHINFER
-export VLLM_FLASHINFER_MOE_BACKEND=throughput
-export VLLM_USE_FLASHINFER_MOE_FP16=1
-export VLLM_USE_FLASHINFER_MOE_FP8=1
-export VLLM_USE_FLASHINFER_MOE_FP4=1
-export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
-# Run on 2 GPUs with tensor parallelism
-CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2-NVFP4 \
-  --host 0.0.0.0 \
-  --port 8345 \
-  --served-model-name default-model lukealonso/MiniMax-M2-NVFP4 \
-  --trust-remote-code \
-  --gpu-memory-utilization 0.95 \
-  --pipeline-parallel-size 1 \
-  --enable-expert-parallel \
-  --tensor-parallel-size 2 \
-  --max-model-len 196608 \
-  --max-num-seqs 32 \
-  --enable-auto-tool-choice \
-  --reasoning-parser minimax_m2_append_think \
-  --tool-call-parser minimax_m2 \
-  --all2all-backend pplx \
-  --enable-prefix-caching \
-  --enable-chunked-prefill \
-  --max-num-batched-tokens 16384 \
-  --dtype auto \
-  --kv-cache-dtype fp8
-```
 ```
-Environment:
-  Python: 3.12.3
-  vLLM: 0.11.2.dev360+g8e7a89160
-  PyTorch: 2.9.0+cu130
-  CUDA: 13.0
-  GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
-  GPU 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
-  Triton: 3.5.0
-  FlashInfer: 0.5.3
 ```
 Tested (but not extensively validated) on *2x* RTX Pro 6000 Blackwell via:
 (note that these instructions no longer work due to nightly vLLM breaking NVFP4 support)

 Instructions from another user, running on RTX Pro 6000 Blackwell:
+Running this model on vllm/vllm-openai:nightly has been met with mixed results. Sometimes working, sometimes not.
+I noticed that I could not run this model with just the base vllm v0.12.0 because it was built using cuda 12.9.
+Other discussions on this model show that the model is working with vllm v0.12.0 when using cuda 13.
+I stepped through the following instructions in order to reliably build vllm & run this model using vllm 0.12.0 and cuda 13.0.2.
+# Instructions
+## 1. Build VLLM Image
+```bash
+# Clone the VLLM Repo
+if [[ ! -d vllm ]]; then
+    git clone https://github.com/vllm-project/vllm.git
+fi
+# Checkout the v0.12.0 version of VLLM
+cd vllm
+git checkout releases/v0.12.0
+# Build with cuda 13.0.2, use precompiled vllm cu130, & ubuntu 22.04 image
+DOCKER_BUILDKIT=1 \
+    docker build . \
+        --target vllm-openai \
+        --tag vllm/vllm-openai:custom-vllm-0.12.0-cuda-13.0.2-py-3.12 \
+        --file docker/Dockerfile \
+        --build-arg max_jobs=64 \
+        --build-arg nvcc_threads=16 \
+        --build-arg CUDA_VERSION=13.0.2 \
+        --build-arg PYTHON_VERSION=3.12 \
+        --build-arg VLLM_USE_PRECOMPILED=true \
+        --build-arg VLLM_MAIN_CUDA_VERSION=130 \
+        --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.2-devel-ubuntu22.04 \
+        --build-arg RUN_WHEEL_CHECK=false \
+;
+# ubuntu 22.04 required because 20.04 does not have cuda 13.0.2 varient
 ```
+## 2. Run the custom VLLM Image
+```yaml
+services:
+  vllm:
+    # !!! Notice !!!  our custom built image is here from the build command above
+    image: vllm/vllm-openai:custom-vllm-0.12.0-cuda-13.0.2-py-3.12
+    environment:
+     # Optional
+      VLLM_NO_USAGE_STATS: "1"
+      DO_NOT_TRACK: "1"
+      CUDA_DEVICE_ORDER: PCI_BUS_ID
+      VLLM_LOGGING_LEVEL: INFO
+     # Required (I think)
+      VLLM_ATTENTION_BACKEND: FLASHINFER
+      VLLM_FLASHINFER_MOE_BACKEND: throughput
+      VLLM_USE_FLASHINFER_MOE_FP16: 1
+      VLLM_USE_FLASHINFER_MOE_FP8: 1
+      VLLM_USE_FLASHINFER_MOE_FP4: 1
+      VLLM_USE_FLASHINFER_MOE_MXFP4_BF16: 1
+      VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS: 1
+      VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: 1
+      # Required (I think)
+      VLLM_WORKER_MULTIPROC_METHOD: spawn
+      NCCL_P2P_DISABLE: 1
+    entrypoint: /bin/bash
+    command:
+      - -c
+      - |
+        vllm serve \
+          /root/.cache/huggingface/hub/models--lukealonso--MiniMax-M2-NVFP4/snapshots/d8993b15556ab7294530f1ba50a93ad130166174/ \
+          --served-model-name minimax-m2-fp4 \
+          --gpu-memory-utilization 0.95 \
+          --pipeline-parallel-size 1 \
+          --enable-expert-parallel \
+          --tensor-parallel-size 4 \
+          --max-model-len $(( 192 * 1024 )) \
+          --max-num-seqs 32 \
+          --enable-auto-tool-choice \
+          --reasoning-parser minimax_m2_append_think \
+          --tool-call-parser minimax_m2 \
+          --all2all-backend pplx \
+          --enable-prefix-caching \
+          --enable-chunked-prefill \
+          --max-num-batched-tokens $(( 64 * 1024 )) \
+          --dtype auto \
+          --kv-cache-dtype fp8 \
+          ;
 ```
+# VLLM CLI Arugments Explained
+## Required Model Arguments
+1. The path to the model (alternatively huggingface `lukealonso/MiniMax-M2-NVFP4`)
+  - `/root/.cache/huggingface/hub/models--lukealonso--MiniMax-M2-NVFP4/snapshots/d8993b15556ab7294530f1ba50a93ad130166174/`
+2. The minimax-m2 model parsers & tool config
+  - `--reasoning-parser minimax_m2_append_think`
+  - `--tool-call-parser minimax_m2`
+  - `--enable-auto-tool-choice`
+## Required Compute Arguments
+1. The parallelism mode (multi gpu, 4x in this example)
+  - `--enable-expert-parallel`
+  - `--pipeline-parallel-size 1`
+  - `--tensor-parallel-size 4`
+  - `--all2all-backend pplx`
+2. The kv cache & layer data types
+  - `--kv-cache-dtype fp8`
+  - `--dtype auto`
+## Optional Model Arguments
+1. The name of the model to present to api clients
+  - `--served-model-name minimax-m2-fp4`
+2. The context size available to the model 192k max
+  - `--max-model-len $(( 192 * 1024 ))`
+3. The prompt chunking size (faster time-til-first-token with large prompts)
+  - `--enable-chunked-prefill`
+  - `--max-num-batched-tokens $(( 64 * 1024 ))`
+## Optional Performance Arguments
+1. How much system VRAM can vllm use?
+  - `--gpu-memory-utilization 0.95`
+2. How many parallel requests can be made to the server at once.
+  - `--max-num-seqs 32`
+3. Allow KV cache sharing for overlapping prompts
+  - `--enable-prefix-caching`
 Tested (but not extensively validated) on *2x* RTX Pro 6000 Blackwell via:
 (note that these instructions no longer work due to nightly vLLM breaking NVFP4 support)