| --- |
| base_model: |
| - MiniMaxAI/MiniMax-M2 |
| --- |
| |
| modelopt NVFP4 quantized MiniMax-M2 |
|
|
| Tested (but not extensively validated) on *2x* RTX Pro 6000 Blackwell via: |
|
|
| ``` |
| inference: |
| image: vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77 |
| container_name: inference |
| ports: |
| - "0.0.0.0:8000:8000" |
| gpus: "all" |
| shm_size: "32g" |
| ipc: "host" |
| ulimits: |
| memlock: -1 |
| nofile: 1048576 |
| environment: |
| - NCCL_IB_DISABLE=1 |
| - NCCL_NVLS_ENABLE=0 |
| - NCCL_P2P_DISABLE=0 |
| - NCCL_SHM_DISABLE=0 |
| - VLLM_USE_V1=1 |
| - VLLM_USE_FLASHINFER_MOE_FP4=1 |
| - OMP_NUM_THREADS=8 |
| - SAFETENSORS_FAST_GPU=1 |
| volumes: |
| - /dev/shm:/dev/shm |
| command: |
| - lukealonso/MiniMax-M2-NVFP4 |
| - --enable-auto-tool-choice |
| - --tool-call-parser |
| - minimax_m2 |
| - --reasoning-parser |
| - minimax_m2_append_think |
| - --all2all-backend |
| - pplx |
| - --enable-expert-parallel |
| - --enable-prefix-caching |
| - --enable-chunked-prefill |
| - --served-model-name |
| - "MiniMax-M2" |
| - --tensor-parallel-size |
| - "2" |
| - --gpu-memory-utilization |
| - "0.95" |
| - --max-num-batched-tokens |
| - "16384" |
| - --dtype |
| - "auto" |
| - --max-num-seqs |
| - "8" |
| - --kv-cache-dtype |
| - fp8 |
| - --host |
| - "0.0.0.0" |
| - --port |
| - "8000" |
| ``` |