DeepSeek-R1-Distill-Llama-8B — NVFP4 Quantization
Headline: NVFP4 shifts gsm8k by -3.94 pp vs BF16 (64.52% → 60.58%) (95% CI ±3.69pp) on a single GB10.
Quantized from deepseek-ai/DeepSeek-R1-Distill-Llama-8B using NVIDIA TensorRT Model-Optimizer.
Requires NVIDIA Blackwell architecture (GB10, B200, GB200, RTX 5090, RTX 6000 Pro).
Upstream model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B @ 6a6f4aa41979
Calibration: default (cnn_dailymail)
Eval context length: 2048 tokens — quality numbers may understate long-context ceiling (see BENCHMARKS.md)
Generated: 2026-04-23 16:02:55Z UTC by Spark NVFP4 Lab
Hardware: NVIDIA GB10 (DGX Spark), aarch64, CUDA 13.0
Why this release exists
Most NVFP4 quantizations on HuggingFace ship without any measured side-by-side
comparison against the BF16 baseline. Spark NVFP4 Lab evaluates every release
against its unquantized parent on the same hardware, same task set, same sampling,
so the delta is real — not estimated. Full per-task results in BENCHMARKS.md;
raw lm-evaluation-harness JSON is shipped in the repo.
Footprint
- On-disk footprint: 14.97 GiB (BF16) → 5.63 GiB (NVFP4) — 2.66× reduction
- Observed peak VRAM during eval (max_seq_len=2048): BF16 ~28.0 GiB / NVFP4 ~14.5 GiB — includes weights + KV cache + activations
Recommended use
- Best for reasoning / math workloads on Blackwell hardware where VRAM or memory bandwidth is the constraint.
- Expect higher aggregate throughput at concurrency ≥ 4 than BF16 due to smaller weight footprint and native FP4 tensor-core paths.
- Not recommended for contexts > 16k tokens without re-validation — this release was evaluated at 2k.
- Always compare against BF16 for your own task before committing to a quantization.
Known-good usage — trtllm-serve
docker run --rm --gpus all --ipc=host --network host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v "$(pwd)/weights:/workspace/model" \
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
trtllm-serve /workspace/model \
--backend pytorch \
--max_batch_size 4 \
--port 8000
# Then hit the OpenAI-compatible endpoint:
curl -s http://127.0.0.1:8000/v1/completions \
-H 'Content-Type: application/json' \
-d '{"model":"model","prompt":"Solve: 37*41 =","max_tokens":128,"temperature":0}'
A chat template is included at chat_template.jinja; most
inference servers pick it up automatically from tokenizer_config.json.
Tested engines
- ✅ TensorRT-LLM 1.1.0rc3 — via
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev - ❓ vLLM — not tested in this release. vLLM ≥ 0.6 Blackwell builds support NVFP4; verify before relying.
- ❓ SGLang — not tested.
Reproducing this artifact
Every artifact ships with REPRODUCE.sh — a one-shot script that
recreates this exact quantization. Container digest, ModelOpt branch, upstream
model revision, and per-file sha256 are pinned in manifest.json.
How to verify this release
# 1. Clone the weights
git clone https://huggingface.co/GumbiiDigital/DeepSeek-R1-Distill-Llama-8B-NVFP4
# 2. Check per-file sha256 against manifest
cd DeepSeek-R1-Distill-Llama-8B-NVFP4
jq -r '.weights_sha256 | to_entries[] | "\(.value) \(.key)"' manifest.json | sha256sum -c -
# 3. Confirm upstream base-model revision matches
jq -r '.base_model_sha' manifest.json
# Should match: huggingface.co/api/models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B → .sha
Known limitations
- NVFP4 weights cannot be loaded by vanilla HF transformers — you need a Blackwell inference engine (TensorRT-LLM, vLLM ≥ 0.6 Blackwell, SGLang Blackwell builds).
- Hardware floor: GB10 / B200 / GB200 / RTX 5090 / RTX 6000 Pro. Older GPUs (Hopper, Ada, Ampere) cannot execute the NVFP4 kernels.
- Quality not evaluated on MMLU, HellaSwag, ARC, TruthfulQA, wikitext-PPL. These
are log-probability tasks; the TRT-LLM OpenAI shim does not currently expose
logprobs, and HF transformers cannot load NVFP4. Release-1 will unblock via a custom backend. - Evaluated at
max_length = 2048. Long-context (>16k) behavior is not characterized in this release.
License
mit — inherits from the upstream parent deepseek-ai/DeepSeek-R1-Distill-Llama-8B. Read it before redistribution.
Made on a single DGX Spark. Questions or feedback? File an issue at Spark NVFP4 Lab on GitHub.
- Downloads last month
- 232
Model tree for GumbiiDigital/DeepSeek-R1-Distill-Llama-8B-NVFP4
Base model
deepseek-ai/DeepSeek-R1-Distill-Llama-8B