MiniMax-M2.1-NVFP4
Format: NVFP4 — optimal partial quantization of weights & activations to NVFP4.
Base model: MiniMax-M2.1-NVFP4
How it was made: AutoQuantized with NVIDIA Model-Optimizer (NVFP4), using the default calibration mix. (cnn_dailymail and nemotron-post-training-dataset-v2)
Check the original model card for information about this model.
sglang Inference Note:
vim /sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py
change the code in 1517 line like this:
), f"Expected {name}_weight_scale.dim(2) == {expected_blocks[name]}, got {weight_scale.shape[-1]}"
else:
pass
# For other backends, ensure the per-input block dimension is aligned to 16.
#assert (
# weight_scale.shape[assert_dim] % block_size == 0
#), f"Expected {name}_weight_scale.dim({assert_dim}) to be divisible by {block_size}"
deploy command MiniMax-M2.1-NVFP4 on sglang:
python3 -m sglang.launch_server --model-path MiniMax-M2.1-NVFP4/ --quantization modelopt_fp4 --tp 8 --attention-backend flashinfer --trust-remote-code
perf
We performed deployment on 8x 5090, and the stress test performance data is provided below.
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 10.7085 |
+-----------------------------------+-----------+
| Number of concurrency | 1 |
+-----------------------------------+-----------+
| Request rate (req/s) | -1 |
+-----------------------------------+-----------+
| Total requests | 1 |
+-----------------------------------+-----------+
| Succeed requests | 1 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 47.8126 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 143.438 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 0.0934 |
+-----------------------------------+-----------+
| Average latency (s) | 10.7085 |
+-----------------------------------+-----------+
| Average time to first token (s) | 0.5682 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0198 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0202 |
+-----------------------------------+-----------+
| Average input tokens per request | 1024 |
+-----------------------------------+-----------+
| Average output tokens per request | 512 |
+-----------------------------------+-----------+
2026-01-07 04:00:24 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.5682 | 0.0196 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 |
| 25% | 0.5682 | 0.0197 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 |
| 50% | 0.5682 | 0.0198 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 |
| 66% | 0.5682 | 0.0199 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 |
| 75% | 0.5682 | 0.0199 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 |
| 80% | 0.5682 | 0.0199 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 |
| 90% | 0.5682 | 0.0201 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 |
| 95% | 0.5682 | 0.0204 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 |
| 98% | 0.5682 | 0.0393 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 |
| 99% | 0.5682 | 0.0396 | 0.0198 | 10.7085 | 1024 | 512 | 47.8126 | 143.4379 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 24.0981 |
+-----------------------------------+-----------+
| Number of concurrency | 16 |
+-----------------------------------+-----------+
| Request rate (req/s) | -1 |
+-----------------------------------+-----------+
| Total requests | 16 |
+-----------------------------------+-----------+
| Succeed requests | 16 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 339.944 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1019.83 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 0.664 |
+-----------------------------------+-----------+
| Average latency (s) | 24.0845 |
+-----------------------------------+-----------+
| Average time to first token (s) | 5.7343 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0359 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0358 |
+-----------------------------------+-----------+
| Average input tokens per request | 1024 |
+-----------------------------------+-----------+
| Average output tokens per request | 512 |
+-----------------------------------+-----------+
2026-01-07 04:11:34 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 2.4771 | 0.0275 | 0.0301 | 24.0913 | 1024 | 512 | 21.2486 | 63.7458 |
| 25% | 3.6108 | 0.028 | 0.0313 | 24.0928 | 1024 | 512 | 21.2493 | 63.7479 |
| 50% | 5.8605 | 0.0284 | 0.0357 | 24.0939 | 1024 | 512 | 21.2507 | 63.7521 |
| 66% | 6.985 | 0.0287 | 0.0379 | 24.094 | 1024 | 512 | 21.251 | 63.753 |
| 75% | 8.11 | 0.0289 | 0.0401 | 24.095 | 1024 | 512 | 21.252 | 63.7559 |
| 80% | 8.11 | 0.029 | 0.0401 | 24.095 | 1024 | 512 | 21.252 | 63.7559 |
| 90% | 8.7294 | 0.0295 | 0.0423 | 24.0957 | 1024 | 512 | 21.2525 | 63.7576 |
| 95% | 9.3849 | 0.0298 | 0.0445 | 24.0971 | 1024 | 512 | 21.3819 | 64.1458 |
| 98% | 9.3849 | 0.0308 | 0.0445 | 24.0971 | 1024 | 512 | 21.3819 | 64.1458 |
| 99% | 9.3849 | 0.0328 | 0.0445 | 24.0971 | 1024 | 512 | 21.3819 | 64.1458 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 35.9271 |
+-----------------------------------+-----------+
| Number of concurrency | 32 |
+-----------------------------------+-----------+
| Request rate (req/s) | -1 |
+-----------------------------------+-----------+
| Total requests | 32 |
+-----------------------------------+-----------+
| Succeed requests | 32 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 456.034 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1368.1 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 0.8907 |
+-----------------------------------+-----------+
| Average latency (s) | 35.914 |
+-----------------------------------+-----------+
| Average time to first token (s) | 10.2324 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0503 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0501 |
+-----------------------------------+-----------+
| Average input tokens per request | 1024 |
+-----------------------------------+-----------+
| Average output tokens per request | 512 |
+-----------------------------------+-----------+
2026-01-07 04:14:20 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 3.5765 | 0.0338 | 0.0369 | 35.913 | 1024 | 512 | 14.2526 | 42.7577 |
| 25% | 5.8246 | 0.0351 | 0.0413 | 35.9153 | 1024 | 512 | 14.254 | 42.7621 |
| 50% | 10.3224 | 0.0356 | 0.0501 | 35.9185 | 1024 | 512 | 14.2545 | 42.7636 |
| 66% | 13.6935 | 0.0359 | 0.0564 | 35.9196 | 1024 | 512 | 14.2556 | 42.7667 |
| 75% | 14.8185 | 0.0362 | 0.0589 | 35.9198 | 1024 | 512 | 14.2561 | 42.7682 |
| 80% | 15.9416 | 0.0365 | 0.0611 | 35.9199 | 1024 | 512 | 14.2561 | 42.7683 |
| 90% | 17.0672 | 0.037 | 0.0633 | 35.9233 | 1024 | 512 | 14.2567 | 42.77 |
| 95% | 17.6148 | 0.0373 | 0.0655 | 35.9244 | 1024 | 512 | 14.2573 | 42.7718 |
| 98% | 17.6331 | 0.0379 | 0.0677 | 35.9256 | 1024 | 512 | 14.3066 | 42.9197 |
| 99% | 17.6331 | 0.0394 | 0.0677 | 35.9256 | 1024 | 512 | 14.3066 | 42.9197 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
- Downloads last month
- 14
Model tree for Tengyunw/MiniMax-M2.1-NVFP4
Base model
MiniMaxAI/MiniMax-M2.1