lukealonso commited on
Commit
9abb935
·
verified ·
1 Parent(s): 20886f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -40
README.md CHANGED
@@ -7,50 +7,141 @@ modelopt NVFP4 quantized MiniMax-M2
7
 
8
  Instructions from another user, running on RTX Pro 6000 Blackwell:
9
 
10
- ```
11
- export VLLM_ATTENTION_BACKEND=FLASHINFER
12
- export VLLM_FLASHINFER_MOE_BACKEND=throughput
13
- export VLLM_USE_FLASHINFER_MOE_FP16=1
14
- export VLLM_USE_FLASHINFER_MOE_FP8=1
15
- export VLLM_USE_FLASHINFER_MOE_FP4=1
16
- export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
17
-
18
- # Run on 2 GPUs with tensor parallelism
19
- CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1} /opt/vllm/bin/vllm serve lukealonso/MiniMax-M2-NVFP4 \
20
- --host 0.0.0.0 \
21
- --port 8345 \
22
- --served-model-name default-model lukealonso/MiniMax-M2-NVFP4 \
23
- --trust-remote-code \
24
- --gpu-memory-utilization 0.95 \
25
- --pipeline-parallel-size 1 \
26
- --enable-expert-parallel \
27
- --tensor-parallel-size 2 \
28
- --max-model-len 196608 \
29
- --max-num-seqs 32 \
30
- --enable-auto-tool-choice \
31
- --reasoning-parser minimax_m2_append_think \
32
- --tool-call-parser minimax_m2 \
33
- --all2all-backend pplx \
34
- --enable-prefix-caching \
35
- --enable-chunked-prefill \
36
- --max-num-batched-tokens 16384 \
37
- --dtype auto \
38
- --kv-cache-dtype fp8
39
- ```
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ```
42
- Environment:
43
-
44
- Python: 3.12.3
45
- vLLM: 0.11.2.dev360+g8e7a89160
46
- PyTorch: 2.9.0+cu130
47
- CUDA: 13.0
48
- GPU 0: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
49
- GPU 1: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
50
- Triton: 3.5.0
51
- FlashInfer: 0.5.3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
  ```
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  Tested (but not extensively validated) on *2x* RTX Pro 6000 Blackwell via:
56
  (note that these instructions no longer work due to nightly vLLM breaking NVFP4 support)
 
7
 
8
  Instructions from another user, running on RTX Pro 6000 Blackwell:
9
 
10
+ Running this model on vllm/vllm-openai:nightly has been met with mixed results. Sometimes working, sometimes not.
11
+
12
+ I noticed that I could not run this model with just the base vllm v0.12.0 because it was built using cuda 12.9.
13
+ Other discussions on this model show that the model is working with vllm v0.12.0 when using cuda 13.
14
+
15
+ I stepped through the following instructions in order to reliably build vllm & run this model using vllm 0.12.0 and cuda 13.0.2.
16
+
17
+ # Instructions
18
+ ## 1. Build VLLM Image
19
+
20
+ ```bash
21
+ # Clone the VLLM Repo
22
+ if [[ ! -d vllm ]]; then
23
+ git clone https://github.com/vllm-project/vllm.git
24
+ fi
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ # Checkout the v0.12.0 version of VLLM
27
+ cd vllm
28
+ git checkout releases/v0.12.0
29
+
30
+ # Build with cuda 13.0.2, use precompiled vllm cu130, & ubuntu 22.04 image
31
+ DOCKER_BUILDKIT=1 \
32
+ docker build . \
33
+ --target vllm-openai \
34
+ --tag vllm/vllm-openai:custom-vllm-0.12.0-cuda-13.0.2-py-3.12 \
35
+ --file docker/Dockerfile \
36
+ --build-arg max_jobs=64 \
37
+ --build-arg nvcc_threads=16 \
38
+ --build-arg CUDA_VERSION=13.0.2 \
39
+ --build-arg PYTHON_VERSION=3.12 \
40
+ --build-arg VLLM_USE_PRECOMPILED=true \
41
+ --build-arg VLLM_MAIN_CUDA_VERSION=130 \
42
+ --build-arg BUILD_BASE_IMAGE=nvidia/cuda:13.0.2-devel-ubuntu22.04 \
43
+ --build-arg RUN_WHEEL_CHECK=false \
44
+ ;
45
+ # ubuntu 22.04 required because 20.04 does not have cuda 13.0.2 varient
46
  ```
47
+
48
+ ## 2. Run the custom VLLM Image
49
+
50
+ ```yaml
51
+ services:
52
+ vllm:
53
+
54
+ # !!! Notice !!! our custom built image is here from the build command above
55
+ image: vllm/vllm-openai:custom-vllm-0.12.0-cuda-13.0.2-py-3.12
56
+
57
+ environment:
58
+ # Optional
59
+ VLLM_NO_USAGE_STATS: "1"
60
+ DO_NOT_TRACK: "1"
61
+ CUDA_DEVICE_ORDER: PCI_BUS_ID
62
+ VLLM_LOGGING_LEVEL: INFO
63
+
64
+ # Required (I think)
65
+ VLLM_ATTENTION_BACKEND: FLASHINFER
66
+ VLLM_FLASHINFER_MOE_BACKEND: throughput
67
+ VLLM_USE_FLASHINFER_MOE_FP16: 1
68
+ VLLM_USE_FLASHINFER_MOE_FP8: 1
69
+ VLLM_USE_FLASHINFER_MOE_FP4: 1
70
+ VLLM_USE_FLASHINFER_MOE_MXFP4_BF16: 1
71
+ VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8_CUTLASS: 1
72
+ VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8: 1
73
+
74
+ # Required (I think)
75
+ VLLM_WORKER_MULTIPROC_METHOD: spawn
76
+ NCCL_P2P_DISABLE: 1
77
+
78
+ entrypoint: /bin/bash
79
+ command:
80
+ - -c
81
+ - |
82
+ vllm serve \
83
+ /root/.cache/huggingface/hub/models--lukealonso--MiniMax-M2-NVFP4/snapshots/d8993b15556ab7294530f1ba50a93ad130166174/ \
84
+ --served-model-name minimax-m2-fp4 \
85
+ --gpu-memory-utilization 0.95 \
86
+ --pipeline-parallel-size 1 \
87
+ --enable-expert-parallel \
88
+ --tensor-parallel-size 4 \
89
+ --max-model-len $(( 192 * 1024 )) \
90
+ --max-num-seqs 32 \
91
+ --enable-auto-tool-choice \
92
+ --reasoning-parser minimax_m2_append_think \
93
+ --tool-call-parser minimax_m2 \
94
+ --all2all-backend pplx \
95
+ --enable-prefix-caching \
96
+ --enable-chunked-prefill \
97
+ --max-num-batched-tokens $(( 64 * 1024 )) \
98
+ --dtype auto \
99
+ --kv-cache-dtype fp8 \
100
+ ;
101
+
102
+
103
+
104
  ```
105
 
106
+ # VLLM CLI Arugments Explained
107
+
108
+ ## Required Model Arguments
109
+
110
+ 1. The path to the model (alternatively huggingface `lukealonso/MiniMax-M2-NVFP4`)
111
+ - `/root/.cache/huggingface/hub/models--lukealonso--MiniMax-M2-NVFP4/snapshots/d8993b15556ab7294530f1ba50a93ad130166174/`
112
+ 2. The minimax-m2 model parsers & tool config
113
+ - `--reasoning-parser minimax_m2_append_think`
114
+ - `--tool-call-parser minimax_m2`
115
+ - `--enable-auto-tool-choice`
116
+
117
+ ## Required Compute Arguments
118
+ 1. The parallelism mode (multi gpu, 4x in this example)
119
+ - `--enable-expert-parallel`
120
+ - `--pipeline-parallel-size 1`
121
+ - `--tensor-parallel-size 4`
122
+ - `--all2all-backend pplx`
123
+ 2. The kv cache & layer data types
124
+ - `--kv-cache-dtype fp8`
125
+ - `--dtype auto`
126
+
127
+ ## Optional Model Arguments
128
+ 1. The name of the model to present to api clients
129
+ - `--served-model-name minimax-m2-fp4`
130
+ 2. The context size available to the model 192k max
131
+ - `--max-model-len $(( 192 * 1024 ))`
132
+ 3. The prompt chunking size (faster time-til-first-token with large prompts)
133
+ - `--enable-chunked-prefill`
134
+ - `--max-num-batched-tokens $(( 64 * 1024 ))`
135
+
136
+ ## Optional Performance Arguments
137
+ 1. How much system VRAM can vllm use?
138
+ - `--gpu-memory-utilization 0.95`
139
+ 2. How many parallel requests can be made to the server at once.
140
+ - `--max-num-seqs 32`
141
+ 3. Allow KV cache sharing for overlapping prompts
142
+ - `--enable-prefix-caching`
143
+
144
+
145
 
146
  Tested (but not extensively validated) on *2x* RTX Pro 6000 Blackwell via:
147
  (note that these instructions no longer work due to nightly vLLM breaking NVFP4 support)