Instructions to use z-lab/Qwen3.5-27B-DFlash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use z-lab/Qwen3.5-27B-DFlash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="z-lab/Qwen3.5-27B-DFlash", trust_remote_code=True)# Load model directly from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("z-lab/Qwen3.5-27B-DFlash", trust_remote_code=True) model = AutoModel.from_pretrained("z-lab/Qwen3.5-27B-DFlash", trust_remote_code=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use z-lab/Qwen3.5-27B-DFlash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "z-lab/Qwen3.5-27B-DFlash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "z-lab/Qwen3.5-27B-DFlash", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/z-lab/Qwen3.5-27B-DFlash
- SGLang
How to use z-lab/Qwen3.5-27B-DFlash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "z-lab/Qwen3.5-27B-DFlash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "z-lab/Qwen3.5-27B-DFlash", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "z-lab/Qwen3.5-27B-DFlash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "z-lab/Qwen3.5-27B-DFlash", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use z-lab/Qwen3.5-27B-DFlash with Docker Model Runner:
docker model run hf.co/z-lab/Qwen3.5-27B-DFlash
FP8 work for base model or is 16-bit of 27B required?
Running vllm with dflash on FP8 of 27B, 15 spec num averages very low acceptance rate ~12%. spec=8 is around 25-30%. Performance at 8 is on par with MTP=3.
I believe this draft model can also be used with Qwen3.5-27B-FP8, I benchmarked this draft model with both the BF16 target model and the FP8 target model on humaneval, and the acceptance length is very close.
Here are the Qwen3.5-27B results on vLLM.
Successful requests: 164
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 389.19
Total input tokens: 24600
Total generated tokens: 165775
Request throughput (req/s): 0.42
Output token throughput (tok/s): 425.95
Peak output token throughput (tok/s): 57.00
Peak concurrent requests: 3.00
Total token throughput (tok/s): 489.16
---------------Time to First Token----------------
Mean TTFT (ms): 66.36
Median TTFT (ms): 65.70
P99 TTFT (ms): 84.54
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 2.27
Median TPOT (ms): 2.18
P99 TPOT (ms): 3.68
---------------Inter-token Latency----------------
Mean ITL (ms): 18.30
Median ITL (ms): 18.35
P99 ITL (ms): 20.29
---------------Speculative Decoding---------------
Acceptance rate (%): 47.24
Acceptance length: 8.09
Drafts: 20503
Draft tokens: 307545
Accepted tokens: 145292
Per-position acceptance (%):
Position 0: 92.54
Position 1: 82.47
Position 2: 72.99
Position 3: 64.77
Position 4: 57.66
Position 5: 51.52
Position 6: 46.24
Position 7: 41.87
Position 8: 37.78
Position 9: 34.20
Position 10: 31.07
Position 11: 28.10
Position 12: 25.33
Position 13: 22.50
Position 14: 19.59
Here are the Qwen3.5-27B-FP8 results:
Successful requests: 164
Failed requests: 0
Maximum request concurrency: 1
Benchmark duration (s): 395.08
Total input tokens: 24600
Total generated tokens: 165556
Request throughput (req/s): 0.42
Output token throughput (tok/s): 419.05
Peak output token throughput (tok/s): 57.00
Peak concurrent requests: 3.00
Total token throughput (tok/s): 481.31
---------------Time to First Token----------------
Mean TTFT (ms): 91.81
Median TTFT (ms): 66.50
P99 TTFT (ms): 127.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 2.23
Median TPOT (ms): 2.10
P99 TPOT (ms): 3.75
---------------Inter-token Latency----------------
Mean ITL (ms): 18.16
Median ITL (ms): 17.93
P99 ITL (ms): 20.07
---------------Speculative Decoding---------------
Acceptance rate (%): 46.52
Acceptance length: 7.98
Drafts: 20754
Draft tokens: 311310
Accepted tokens: 144822
Per-position acceptance (%):
Position 0: 92.42
Position 1: 82.01
Position 2: 72.59
Position 3: 63.95
Position 4: 56.51
Position 5: 50.32
Position 6: 45.27
Position 7: 40.82
Position 8: 36.81
Position 9: 33.44
Position 10: 30.36
Position 11: 27.66
Position 12: 24.73
Position 13: 21.82
Position 14: 19.10
Interesting, it must be a mis configuration on my Sm120 6000 blackwell and vllm cu130nightly.
As DFlash was just merged into vLLM, there are probably some issues. I will try to run on RTX 6000 Blackwell to see if I can reproduce your problem 👀
similarly, i'm interested in if it's possible to use the parquant model instead of either BF16 or FP8 z-lab/Qwen3.5-27B-PARO
I run a 2x3090 setup and am wondering if anyone in the community has tried this or if ampere in general has been tested.
Tested again on vllm 18.2rc1 cu130 nightly. rtx 6000 blackwell.
vllm/vllm-openai:cu130-nightly \
/models/Qwen3.5-27B-FP8 \
--async-scheduling \
--quantization fp8 \
--served-model-name Qwen3.5 \
--tensor-parallel-size 1 \
--dtype auto \
--kv-cache-dtype auto \
--trust-remote-code \
--gpu-memory-utilization 0.92 \
--max-num-seqs 32 \
--enable-prefix-caching \
--enable-chunked-prefill \
--max-num-batched-tokens 16384 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--speculative-config '{"method": "dflash", "model": "/models/Qwen3.5-27B-DFlash", "num_speculative_tokens": 8}' \
--max-model-len 262144
Acceptance still averaging ~20%. I tried max-num-batched-tokens 8192 and 16384. With and without multi modal.
Are there still PR's from z-lab pending merge to master?
I confirm same behaviour with unoid.
Running on H100, CUDA 13.1 vllm 0.19.1rc1.dev70+g8060bb033 (Build from source)
CUDA_VISIBLE_DEVICES=0 vllm serve /share_weight/Qwen3.5-27B-FP8
--served-model-name Qwen3.5-27B-FP8 --host 0.0.0.0 --port 9810 \
--tensor-parallel-size 1 --speculative-config '{"method": "dflash", "model": "/share_weight/Qwen3.5-27B-DFlash", "num_speculative_tokens": 8}' \
--max-num-batched-tokens 32768 --max-model-len 220000 \
--reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder \
--chat-template /share_weight/Qwen3.5-27B-FP8/chat_template.jinja
Acceptance averaging ~20%
That's interesting, I tested on B200 and FP8 seems works well. Let me test on H100.
On DGX Spark (different Blackwell with similar shader model to RTX Pro 6000; Spark is 12.1a) I am able to get this working in vLLM 0.19 but similarly see relatively low acceptance at n=15. My work has some specialty vocab so I'm seeing under 20%, usually 12-18% acceptance, with near-zero beyond position 8. Some of this is probably due to the documents and complexity. My guess is this is a similar root cause.
It's still a decent throughput gain. Realizing the full potential would be incredible! I'd gladly test anything you like.
Edit: In case it might help, I am using the official Qwen FP8 quant with the --speculative-config suggested, flash_attn, and --max-num-batched-tokens 32768. I also have prefix caching enabled as well as reasoning and tool calling.
In case it helps, these lines from the startup log seemed a little odd, as if it though it was an EAGLE model but with strange features. It does work though:
(EngineCore pid=160) INFO 04-08 16:41:56 [eagle.py:1395] Detected EAGLE model without its own embed_tokens in the checkpoint. Sharing target model embedding weights with the draft model.
(EngineCore pid=160) INFO 04-08 16:41:56 [eagle.py:1450] Detected EAGLE model without its own lm_head in the checkpoint. Sharing target model lm_head weights with the draft model.
(EngineCore pid=160) INFO 04-08 16:41:56 [gpu_model_runner.py:4797] Using auxiliary layers from speculative config: (1, 16, 31, 46, 61)
Quick feedback,
--no-enable-prefix-caching
this flag help to boots the acceptance rate from <20% to 30-35%.
@matichon Thanks for the information! That’s interesting. I would have thought prefix caching shouldn’t directly affect the acceptance rate. I need to take a closer look at this.
I think there were some bugs in vLLM. The bugs may not have been with DFlash but rather quite possibly the model outputs or maybe Flash Attention 2. Regardless, I rebuilt yesterday and tested with FP8 and the int4-AutoRound quants on DGX Spark.
Where I was seeing really poor acceptance beyond position 2, now in benchmarks (especially for coding tasks) I see throughput of up to ~70 tok/s. That's incredible on this hardware. It isn't all that high - but even for complex analysis of scientific documents it is a boost over the built-in MTP.
Initially I ran it with --no-enable-prefix-caching per @matichon above, but just finished testing with prefix caching enabled again, and the acceptance rates and throughput are stable. Again, it feels like a bug has been patched.
I use 4*3090 FP16 27B with a low acceptance rate
docker rm -f $(docker ps -aq)
docker run -d
--gpus all
--memory 32g
--memory-swap 64g
--shm-size 32g
-p 8000:8000
-v /home/cheng/model/Qwen3.5-27B:/model
-v /home/cheng/model/Qwen3.5-27B-DFlash:/draft-model
-v /home/cheng/vllm_cache:/root/.cache/vllm
--ipc=host
--name vllm
--env VLLM_USE_FLASHINFER_SAMPLER=1
--env OMP_NUM_THREADS=2
--env VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
--env PYTORCH_ALLOC_CONF=expandable_segments:True
--env HF_HUB_OFFLINE=1
--env VLLM_ENGINE_ITERATION_TIMEOUT_S=1800
--env VLLM_ENGINE_READY_TIMEOUT_S=1800
--env VLLM_RPC_TIMEOUT=1800000
--env VLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=1800
--env VLLM_LOG_STATS_INTERVAL=1.0
--env LD_LIBRARY_PATH='/usr/local/nvidia/lib64:/usr/local/nvidia/lib:/usr/lib/x86_64-linux-gnu'
vllm/vllm-openai:nightly
/model
--served-model-name Qwen3.5-27B
--mm-encoder-attn-backend TORCH_SDPA
--dtype auto
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--gpu-memory-utilization 0.90
--disable-custom-all-reduce
--max-model-len 131072
--max-num-seqs 10
--tensor-parallel-size 4
--limit-mm-per-prompt '{"image": 30, "video": 0}'
--async-scheduling
--default-chat-template-kwargs '{"enable_thinking": false}'
--generation-config vllm
--speculative-config '{"method": "dflash", "model": "/draft-model", "num_speculative_tokens": 5}'
--host 0.0.0.0
Quick summarize
On the GSM8K benchmark with a concurrency of 1, I achieved an average acceptance length of 7–8 tokens at a 40% acceptance rate.
However, using my own random chat inputs in the first turn, the acceptance length dropped to approximately 3 tokens with a 20% acceptance rate.
In comparison, using the native Multi-Token Prediction (MTP) with num_speculative_tokens set to 5,
my own inputs achieved about 3.5 tokens at a 60% acceptance rate.
I suspect this discrepancy stems from the robustness of the training data distribution.
Hopefully, the poor performance on the custom input is simply due to using an incorrect checkpoint for the DFlash model.
| num-seqs | concurrent | tok/s |
|---|---|---|
| 1 | 1 | 200 ++ |
| 1 | 2 | 200 ++ |
| 16 | 1 | 200 ++ |
| 16 | 2 | 100 ++ |
Dependency
H100 80GB SXM
cuda 12.8
vllm 0.19.1rc1.dev297+g799973af4
16 num-seqs
Concurrent 1
python -m dflash.benchmark --backend vllm --base-url http://localhost:9810 --model Qwen3.5-27B-FP8 --dataset gsm8k --num-prompts 128 --concurrency 1 --enable-thinking
(APIServer pid=8529) INFO 04-15 07:47:30 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 7.42, Accepted throughput: 230.47 tokens/s, Drafted throughput: 574.33
tokens/s, Accepted: 2305 tokens, Drafted: 5744 tokens, Per-position acceptance rate: 0.933, 0.805, 0.716, 0.621, 0.507, 0.443, 0.412, 0.370, 0.334, 0.292, 0.240, 0.209, 0.175,
0.139, 0.120, 0.103, Avg Draft acceptance rate: 40.1%
(APIServer pid=8529) INFO: 127.0.0.1:56872 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=8529) INFO 04-15 07:47:40 [loggers.py:271] Engine 000: Avg prompt throughput: 13.0 tokens/s, Avg generation throughput: 247.4 tokens/s, Running: 1 reqs, Waiting
: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 0.0%
(APIServer pid=8529) INFO 04-15 07:47:40 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 6.75, Accepted throughput: 210.98 tokens/s, Drafted throughput: 587.14
tokens/s, Accepted: 2110 tokens, Drafted: 5872 tokens, Per-position acceptance rate: 0.883, 0.744, 0.605, 0.534, 0.452, 0.403, 0.349, 0.324, 0.270, 0.237, 0.213, 0.188, 0.161,
0.147, 0.131, 0.109, Avg Draft acceptance rate: 35.9%
Concurrent 2
python -m dflash.benchmark --backend vllm --base-url http://localhost:9810 --model Qwen3.5-27B-FP8 --dataset gsm8k --num-prompts 128 --concurrency 2 --enable-thinking
(APIServer pid=9685) INFO 04-15 07:50:34 [loggers.py:271] Engine 000: Avg prompt throughput: 35.4 tokens/s, Avg generation throughput: 189.6 tokens/s, Running: 2 reqs, Waiting
: 0 reqs, GPU KV cache usage: 18.1%, Prefix cache hit rate: 0.0%(APIServer pid=9685) INFO 04-15 07:50:34 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.73, Accepted throughput: 120.08 tokens/s, Drafted throughput: 1113.44 tokens/s, Accepted: 1201 tokens, Drafted: 11136 tokens, Per-position acceptance rate: 0.227, 0.201, 0.172, 0.151, 0.136, 0.128, 0.116, 0.101, 0.089, 0.080, 0.072, 0.066, 0.05
9, 0.050, 0.040, 0.036, Avg Draft acceptance rate: 10.8%
(APIServer pid=9685) INFO: 127.0.0.1:49702 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=9685) INFO: 127.0.0.1:49714 - "POST /v1/chat/completions HTTP/1.1" 200 OK(APIServer pid=9685) INFO 04-15 07:50:44 [loggers.py:271] Engine 000: Avg prompt throughput: 19.8 tokens/s, Avg generation throughput: 187.1 tokens/s, Running: 2 reqs, Waiting
: 0 reqs, GPU KV cache usage: 18.1%, Prefix cache hit rate: 0.0%(APIServer pid=9685) INFO 04-15 07:50:44 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 2.66, Accepted throughput: 116.68 tokens/s, Drafted throughput: 1123.04 tokens/s, Accepted: 1167 tokens, Drafted: 11232 tokens, Per-position acceptance rate: 0.362, 0.283, 0.219, 0.171, 0.141, 0.117, 0.081, 0.067, 0.054, 0.047, 0.036, 0.027, 0.02
3, 0.017, 0.010, 0.007, Avg Draft acceptance rate: 10.4%
1 num-seqs
Concurrent 1
python -m dflash.benchmark --backend vllm --base-url http://localhost:9810 --model Qwen3.5-27B-FP8 --dataset gsm8k --num-prompts 128 --concurrency 1 --enable-thinking
(APIServer pid=10840) INFO 04-15 07:54:48 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 5.84, Accepted throughput: 177.49 tokens/s, Drafted throughput: 587.16 tokens/s, Accepted: 1775 tokens, Drafted: 5872 tokens, Per-position acceptance rate: 0.858, 0.711, 0.553, 0.482, 0.406, 0.346, 0.283, 0.243, 0.193, 0.150, 0.131, 0.114, 0.104
, 0.095, 0.084, 0.082, Avg Draft acceptance rate: 30.2%
(APIServer pid=10840) INFO: 127.0.0.1:43544 - "POST /v1/chat/completions HTTP/1.1" 200 OK(APIServer pid=10840) INFO 04-15 07:54:58 [loggers.py:271] Engine 000: Avg prompt throughput: 7.8 tokens/s, Avg generation throughput: 255.3 tokens/s, Running: 1 reqs, Waiting
: 0 reqs, GPU KV cache usage: 9.6%, Prefix cache hit rate: 0.0%(APIServer pid=10840) INFO 04-15 07:54:58 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 6.96, Accepted throughput: 219.29 tokens/s, Drafted throughput: 588.77 tokens/s, Accepted: 2193 tokens, Drafted: 5888 tokens, Per-position acceptance rate: 0.905, 0.766, 0.668, 0.568, 0.476, 0.421, 0.370, 0.334, 0.277, 0.231, 0.201, 0.185, 0.160
, 0.152, 0.133, 0.111, Avg Draft acceptance rate: 37.2%
Concurrent 2
python -m dflash.benchmark --backend vllm --base-url http://localhost:9810 --model Qwen3.5-27B-FP8 --dataset gsm8k --num-prompts 128 --concurrency 2 --enable-thinking
(APIServer pid=10840) INFO 04-15 07:52:38 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 6.63, Accepted throughput: 202.57 tokens/s, Drafted throughput: 575.90 tokens/s, Accepted: 2026 tokens, Drafted: 5760 tokens, Per-position acceptance rate: 0.903, 0.775, 0.636, 0.533, 0.489, 0.425, 0.358, 0.286, 0.267, 0.228, 0.194, 0.156, 0.131, 0.103, 0.081, 0.064, Avg Draft acceptance rate: 35.2%
(APIServer pid=10840) INFO: 127.0.0.1:45262 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=10840) INFO: 127.0.0.1:45278 - "POST /v1/chat/completions HTTP/1.1" 200 OK(APIServer pid=10840) INFO 04-15 07:52:48 [loggers.py:271] Engine 000: Avg prompt throughput: 19.8 tokens/s, Avg generation throughput: 254.4 tokens/s, Running: 1 reqs, Waitin
g: 1 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 0.0%(APIServer pid=10840) INFO 04-15 07:52:48 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 7.13, Accepted throughput: 218.77 tokens/s, Drafted throughput: 571.12 tokens/s, Accepted: 2188 tokens, Drafted: 5712 tokens, Per-position acceptance rate: 0.899, 0.796, 0.686, 0.594, 0.515, 0.457, 0.403, 0.356, 0.289, 0.244, 0.210, 0.188, 0.154
, 0.129, 0.112, 0.098, Avg Draft acceptance rate: 38.3%
After updating to vllm 0.21+ It seems to work pretty good on fp8 (on 0.19 < it seemed to crash)
what was your avg draft acceptance rate ?
vLLM 0.21.0
Image Tag : vllm/vllm-openai:v0.21.0-cu129-ubuntu2404@sha256:cba2cabc5ca33baf0bc4776ed2896fe4c8d8b7be7fbbeca88bc63217d07ad320
(APIServer pid=2568) INFO: 172.18.0.19:52452 - "POST /v1/chat/completions HTTP/1.1" 200 OK(APIServer pid=2568) INFO 05-21 07:47:53 [loggers.py:271] Engine 000: Avg prompt throughput: 1600.4 tokens/s, Avg generation throughput: 30.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 25.4%
(APIServer pid=2568) INFO 05-21 07:47:53 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.87, Accepted throughput: 14.30 tokens/s, Drafted throughput: 262.38 tokens/s, Accepted: 143 tokens, Drafted: 2624 tokens, Per-position acceptance rate: 0.427, 0.232, 0.122, 0.067, 0.018, 0.006, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, 0.000, Avg Draft acceptance rate: 5.4%
(APIServer pid=2568) INFO 05-21 07:48:03 [loggers.py:271] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage
: 0.0%, Prefix cache hit rate: 25.4%
