MiniMax-M2.1-NVFP4

Format: NVFP4 — optimal partial quantization of weights & activations to NVFP4.
Base model: MiniMax-M2.1-NVFP4
How it was made: AutoQuantized with NVIDIA Model-Optimizer (NVFP4), using the default calibration mix. (cnn_dailymail and nemotron-post-training-dataset-v2)

Check the original model card for information about this model.


sglang Inference Note:

vim /sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py

change the code in 1517 line like this:

        ), f"Expected {name}_weight_scale.dim(2) == {expected_blocks[name]}, got {weight_scale.shape[-1]}"
    else:
        pass
        # For other backends, ensure the per-input block dimension is aligned to 16.
        #assert (
        #    weight_scale.shape[assert_dim] % block_size == 0
        #), f"Expected {name}_weight_scale.dim({assert_dim}) to be divisible by {block_size}"

deploy command MiniMax-M2.1-NVFP4 on sglang:

python3 -m sglang.launch_server --model-path  MiniMax-M2.1-NVFP4/   --quantization modelopt_fp4  --tp 8 --attention-backend flashinfer  --trust-remote-code

perf

We performed deployment on 8x 5090, and the stress test performance data is provided below.


Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   10.7085 |
+-----------------------------------+-----------+
| Number of concurrency             |    1      |
+-----------------------------------+-----------+
| Request rate (req/s)              |   -1      |
+-----------------------------------+-----------+
| Total requests                    |    1      |
+-----------------------------------+-----------+
| Succeed requests                  |    1      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |   47.8126 |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    |  143.438  |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.0934 |
+-----------------------------------+-----------+
| Average latency (s)               |   10.7085 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    0.5682 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0198 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.0202 |
+-----------------------------------+-----------+
| Average input tokens per request  | 1024      |
+-----------------------------------+-----------+
| Average output tokens per request |  512      |
+-----------------------------------+-----------+
2026-01-07 04:00:24 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  0.5682  | 0.0196  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     25%     |  0.5682  | 0.0197  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     50%     |  0.5682  | 0.0198  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     66%     |  0.5682  | 0.0199  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     75%     |  0.5682  | 0.0199  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     80%     |  0.5682  | 0.0199  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     90%     |  0.5682  | 0.0201  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     95%     |  0.5682  | 0.0204  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     98%     |  0.5682  | 0.0393  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
|     99%     |  0.5682  | 0.0396  |  0.0198  |   10.7085   |     1024     |      512      |    47.8126     |   143.4379    |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+

Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   24.0981 |
+-----------------------------------+-----------+
| Number of concurrency             |   16      |
+-----------------------------------+-----------+
| Request rate (req/s)              |   -1      |
+-----------------------------------+-----------+
| Total requests                    |   16      |
+-----------------------------------+-----------+
| Succeed requests                  |   16      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |  339.944  |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    | 1019.83   |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.664  |
+-----------------------------------+-----------+
| Average latency (s)               |   24.0845 |
+-----------------------------------+-----------+
| Average time to first token (s)   |    5.7343 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0359 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.0358 |
+-----------------------------------+-----------+
| Average input tokens per request  | 1024      |
+-----------------------------------+-----------+
| Average output tokens per request |  512      |
+-----------------------------------+-----------+
2026-01-07 04:11:34 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  2.4771  | 0.0275  |  0.0301  |   24.0913   |     1024     |      512      |    21.2486     |    63.7458    |
|     25%     |  3.6108  |  0.028  |  0.0313  |   24.0928   |     1024     |      512      |    21.2493     |    63.7479    |
|     50%     |  5.8605  | 0.0284  |  0.0357  |   24.0939   |     1024     |      512      |    21.2507     |    63.7521    |
|     66%     |  6.985   | 0.0287  |  0.0379  |   24.094    |     1024     |      512      |     21.251     |    63.753     |
|     75%     |   8.11   | 0.0289  |  0.0401  |   24.095    |     1024     |      512      |     21.252     |    63.7559    |
|     80%     |   8.11   |  0.029  |  0.0401  |   24.095    |     1024     |      512      |     21.252     |    63.7559    |
|     90%     |  8.7294  | 0.0295  |  0.0423  |   24.0957   |     1024     |      512      |    21.2525     |    63.7576    |
|     95%     |  9.3849  | 0.0298  |  0.0445  |   24.0971   |     1024     |      512      |    21.3819     |    64.1458    |
|     98%     |  9.3849  | 0.0308  |  0.0445  |   24.0971   |     1024     |      512      |    21.3819     |    64.1458    |
|     99%     |  9.3849  | 0.0328  |  0.0445  |   24.0971   |     1024     |      512      |    21.3819     |    64.1458    |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+

Benchmarking summary:
+-----------------------------------+-----------+
| Key                               |     Value |
+===================================+===========+
| Time taken for tests (s)          |   35.9271 |
+-----------------------------------+-----------+
| Number of concurrency             |   32      |
+-----------------------------------+-----------+
| Request rate (req/s)              |   -1      |
+-----------------------------------+-----------+
| Total requests                    |   32      |
+-----------------------------------+-----------+
| Succeed requests                  |   32      |
+-----------------------------------+-----------+
| Failed requests                   |    0      |
+-----------------------------------+-----------+
| Output token throughput (tok/s)   |  456.034  |
+-----------------------------------+-----------+
| Total token throughput (tok/s)    | 1368.1    |
+-----------------------------------+-----------+
| Request throughput (req/s)        |    0.8907 |
+-----------------------------------+-----------+
| Average latency (s)               |   35.914  |
+-----------------------------------+-----------+
| Average time to first token (s)   |   10.2324 |
+-----------------------------------+-----------+
| Average time per output token (s) |    0.0503 |
+-----------------------------------+-----------+
| Average inter-token latency (s)   |    0.0501 |
+-----------------------------------+-----------+
| Average input tokens per request  | 1024      |
+-----------------------------------+-----------+
| Average output tokens per request |  512      |
+-----------------------------------+-----------+
2026-01-07 04:14:20 - evalscope - INFO: 
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
|     10%     |  3.5765  | 0.0338  |  0.0369  |   35.913    |     1024     |      512      |    14.2526     |    42.7577    |
|     25%     |  5.8246  | 0.0351  |  0.0413  |   35.9153   |     1024     |      512      |     14.254     |    42.7621    |
|     50%     | 10.3224  | 0.0356  |  0.0501  |   35.9185   |     1024     |      512      |    14.2545     |    42.7636    |
|     66%     | 13.6935  | 0.0359  |  0.0564  |   35.9196   |     1024     |      512      |    14.2556     |    42.7667    |
|     75%     | 14.8185  | 0.0362  |  0.0589  |   35.9198   |     1024     |      512      |    14.2561     |    42.7682    |
|     80%     | 15.9416  | 0.0365  |  0.0611  |   35.9199   |     1024     |      512      |    14.2561     |    42.7683    |
|     90%     | 17.0672  |  0.037  |  0.0633  |   35.9233   |     1024     |      512      |    14.2567     |     42.77     |
|     95%     | 17.6148  | 0.0373  |  0.0655  |   35.9244   |     1024     |      512      |    14.2573     |    42.7718    |
|     98%     | 17.6331  | 0.0379  |  0.0677  |   35.9256   |     1024     |      512      |    14.3066     |    42.9197    |
|     99%     | 17.6331  | 0.0394  |  0.0677  |   35.9256   |     1024     |      512      |    14.3066     |    42.9197    |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
Downloads last month
14
Safetensors
Model size
115B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tengyunw/MiniMax-M2.1-NVFP4

Quantized
(33)
this model

Datasets used to train Tengyunw/MiniMax-M2.1-NVFP4