Docker Image

#1
by mtcl - opened

I am unable to build vllm with your script, would you happen to have a docker image that I can use?

Canada Quant Labs org

for rtx?

For Nvidia RtX Pro 6000 Blackwell GPUs. I have 2x workstation edition ones with 192 GB of vram.

Canada Quant Labs org
This comment has been hidden
Canada Quant Labs org
This comment has been hidden
Canada Quant Labs org

Disclosure: this comment was generated with AI assistance.

Hey @mtcl — sorry, my earlier replies in this thread were misdirected (now hidden). For your 2× RTX PRO 6000 Blackwell (Workstation Edition, SM 12.0a) setup, the right artifact is the W4A16 sibling, not this NVFP4 card. On consumer/server Blackwell the W4A16 routed experts hit Marlin's native INT4 path (mature, well-tuned cubins), whereas NVFP4 falls back to an FP4-adapted-for-Marlin path that's slower and less stable on SM 12.0a.

We just published a pre-built Docker image specifically for RTX PRO 6000 — bakes the full 13-layer recipe (jasl/vllm@27fd665b + canada-quant BF16 MTP cherry-pick + Marlin MoE c_tmp / workspace 4× patches + cute.arch.fmin shim). No vLLM source build needed.

Model: https://huggingface.co/canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP
Docker image (HF dataset): https://huggingface.co/datasets/canada-quant/dsv4-flash-w4a16-rtxpro6000-image
Repo with the parameterized serve + bench scripts: https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp

Quickstart for your 2× RTX PRO 6000

# 1) Pull the Docker image tarball (~14 GB compressed)
pip install --user --quiet "huggingface_hub>=1.16"
export PATH=$HOME/.local/bin:$PATH
hf download canada-quant/dsv4-flash-w4a16-rtxpro6000-image \
    --repo-type dataset --include "*.tar.gz" --local-dir .
docker load < dsv4-w4a16-rtxpro6000-v1.tar.gz

# 2) Cache the W4A16 model (~159 GB; put it on a fast NVMe)
HF_HOME=/path/to/big/nvme/hf-cache hf download \
    canada-quant/DeepSeek-V4-Flash-W4A16-FP8-MTP

# 3) Pull the serve helper
git clone https://github.com/canada-quant/dsv4-flash-w4a16-fp8-mtp.git
cd dsv4-flash-w4a16-fp8-mtp

# 4) Serve on both GPUs, TP=2. For best throughput use max_num_seqs=8 at 16K-32K context;
#    for long-context use max_num_seqs=1 (we've verified ~256K with MTP on at TP=2).
docker run -d --gpus all --name dsv4-w4a16-serve \
    --shm-size=16g --ipc=host -p 8000:8000 \
    -v /path/to/big/nvme/hf-cache:/root/.cache/huggingface \
    -v $(pwd)/scripts:/workspace/scripts:ro \
    -e TP=2 -e MAX_NUM_SEQS=4 -e MAX_MODEL_LEN=65536 -e GPU_MEM_UTIL=0.95 \
    canada-quant/dsv4-w4a16-rtxpro6000:v1 \
    bash /workspace/scripts/serve_rtx6000pro_w4a16.sh

# 5) Wait for /v1/models (~3-5 min for model load + cudagraph capture)
until curl -sf http://127.0.0.1:8000/v1/models >/dev/null; do sleep 5; done

# 6) Smoke test
curl -sX POST http://127.0.0.1:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"DSV4-W4A16-FP8-MTP",
         "messages":[{"role":"user","content":"What is 17*23?"}],
         "max_tokens":60,"temperature":0}'

What you should see

We verified this image end-to-end on 2× RTX PRO 6000 (Server Edition, same SM 12.0a as your Workstation cards) in a fresh Docker container today:

Bench (TP=2, max_model_len=65536, max_num_seqs=4) Result
AIME-2024 c=4 thinking-high 27/30 correct, MTP acceptance 91.97%, 1 length-truncation, 9161s wall
AIME-2024 c=4 no-think (chat) 18/30 correct, MTP 95.78%, 3175s wall
AIME-2024 c=4 thinking-max in flight — will edit when done

(The 27/30 at thinking-high beats our earlier published 24/30 because we set max_tokens = max_model_len - 500 so reasoning runs to natural stop instead of being capped at 16K — 3 of the previously-truncated problems resolve correctly.)

Key env vars baked into the image (you don't need to set them):

  • VLLM_TEST_FORCE_FP8_MARLIN=1 — routes attention block-FP8 layers through Marlin (the only working SM 12.0a path)
  • VLLM_USE_LAYERNAME=0 — avoids the Inductor MoE FakeScriptObject crash WITHOUT needing --enforce-eager, so CUDA graphs stay enabled (~1.5× decode speedup retained)

Hit any snag, drop a follow-up here and tag me. Happy serving 🚀

@pastapaul has the state of b12x integration into vllm any better to try and run that code path? I've built my own nvfp4.py patch for RTX6000s but it gets caught at multiple steps (cutedsl b12x integration is a bit like choosing versions like random)

Sign up or log in to comment