lukealonso
/

MiniMax-M2-NVFP4

8-bit precision

Model card Files Files and versions

MiniMax-M2-NVFP4 / README.md

lukealonso's picture

Update README.md

da3aa9e verified 4 months ago

|

history blame contribute delete

1.39 kB

	---
	base_model:
	- MiniMaxAI/MiniMax-M2
	---

	modelopt NVFP4 quantized MiniMax-M2

	Tested (but not extensively validated) on 2x RTX Pro 6000 Blackwell via:

	```
	inference:
	image: vllm/vllm-openai:nightly-96142f209453a381fcaf9d9d010bbf8711119a77
	container_name: inference
	ports:
	- "0.0.0.0:8000:8000"
	gpus: "all"
	shm_size: "32g"
	ipc: "host"
	ulimits:
	memlock: -1
	nofile: 1048576
	environment:
	- NCCL_IB_DISABLE=1
	- NCCL_NVLS_ENABLE=0
	- NCCL_P2P_DISABLE=0
	- NCCL_SHM_DISABLE=0
	- VLLM_USE_V1=1
	- VLLM_USE_FLASHINFER_MOE_FP4=1
	- OMP_NUM_THREADS=8
	- SAFETENSORS_FAST_GPU=1
	volumes:
	- /dev/shm:/dev/shm
	command:
	- lukealonso/MiniMax-M2-NVFP4
	- --enable-auto-tool-choice
	- --tool-call-parser
	- minimax_m2
	- --reasoning-parser
	- minimax_m2_append_think
	- --all2all-backend
	- pplx
	- --enable-expert-parallel
	- --enable-prefix-caching
	- --enable-chunked-prefill
	- --served-model-name
	- "MiniMax-M2"
	- --tensor-parallel-size
	- "2"
	- --gpu-memory-utilization
	- "0.95"
	- --max-num-batched-tokens
	- "16384"
	- --dtype
	- "auto"
	- --max-num-seqs
	- "8"
	- --kv-cache-dtype
	- fp8
	- --host
	- "0.0.0.0"
	- --port
	- "8000"
	```