DeepSeek-R1-Distill-Llama-8B — NVFP4 Quantization

Headline: NVFP4 shifts gsm8k by -3.94 pp vs BF16 (64.52% → 60.58%) (95% CI ±3.69pp) on a single GB10.

Quantized from deepseek-ai/DeepSeek-R1-Distill-Llama-8B using NVIDIA TensorRT Model-Optimizer. Requires NVIDIA Blackwell architecture (GB10, B200, GB200, RTX 5090, RTX 6000 Pro).

Upstream model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B @ 6a6f4aa41979 Calibration: default (cnn_dailymail) Eval context length: 2048 tokens — quality numbers may understate long-context ceiling (see BENCHMARKS.md) Generated: 2026-04-23 16:02:55Z UTC by Spark NVFP4 Lab Hardware: NVIDIA GB10 (DGX Spark), aarch64, CUDA 13.0

Why this release exists

Most NVFP4 quantizations on HuggingFace ship without any measured side-by-side comparison against the BF16 baseline. Spark NVFP4 Lab evaluates every release against its unquantized parent on the same hardware, same task set, same sampling, so the delta is real — not estimated. Full per-task results in BENCHMARKS.md; raw lm-evaluation-harness JSON is shipped in the repo.

Footprint

On-disk footprint: 14.97 GiB (BF16) → 5.63 GiB (NVFP4) — 2.66× reduction
Observed peak VRAM during eval (max_seq_len=2048): BF16 ~28.0 GiB / NVFP4 ~14.5 GiB — includes weights + KV cache + activations

Recommended use

Best for reasoning / math workloads on Blackwell hardware where VRAM or memory bandwidth is the constraint.
Expect higher aggregate throughput at concurrency ≥ 4 than BF16 due to smaller weight footprint and native FP4 tensor-core paths.
Not recommended for contexts > 16k tokens without re-validation — this release was evaluated at 2k.
Always compare against BF16 for your own task before committing to a quantization.

Known-good usage — `trtllm-serve`

docker run --rm --gpus all --ipc=host --network host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -v "$(pwd)/weights:/workspace/model" \
  nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
  trtllm-serve /workspace/model \
    --backend pytorch \
    --max_batch_size 4 \
    --port 8000

# Then hit the OpenAI-compatible endpoint:
curl -s http://127.0.0.1:8000/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"model","prompt":"Solve: 37*41 =","max_tokens":128,"temperature":0}'

A chat template is included at chat_template.jinja; most inference servers pick it up automatically from tokenizer_config.json.

Tested engines

✅ TensorRT-LLM 1.1.0rc3 — via nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev
❓ vLLM — not tested in this release. vLLM ≥ 0.6 Blackwell builds support NVFP4; verify before relying.
❓ SGLang — not tested.

Reproducing this artifact

Every artifact ships with REPRODUCE.sh — a one-shot script that recreates this exact quantization. Container digest, ModelOpt branch, upstream model revision, and per-file sha256 are pinned in manifest.json.

How to verify this release

# 1. Clone the weights
git clone https://huggingface.co/GumbiiDigital/DeepSeek-R1-Distill-Llama-8B-NVFP4

# 2. Check per-file sha256 against manifest
cd DeepSeek-R1-Distill-Llama-8B-NVFP4
jq -r '.weights_sha256 | to_entries[] | "\(.value)  \(.key)"' manifest.json | sha256sum -c -

# 3. Confirm upstream base-model revision matches
jq -r '.base_model_sha' manifest.json
# Should match: huggingface.co/api/models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B  →  .sha

Known limitations

NVFP4 weights cannot be loaded by vanilla HF transformers — you need a Blackwell inference engine (TensorRT-LLM, vLLM ≥ 0.6 Blackwell, SGLang Blackwell builds).
Hardware floor: GB10 / B200 / GB200 / RTX 5090 / RTX 6000 Pro. Older GPUs (Hopper, Ada, Ampere) cannot execute the NVFP4 kernels.
Quality not evaluated on MMLU, HellaSwag, ARC, TruthfulQA, wikitext-PPL. These are log-probability tasks; the TRT-LLM OpenAI shim does not currently expose logprobs, and HF transformers cannot load NVFP4. Release-1 will unblock via a custom backend.
Evaluated at max_length = 2048. Long-context (>16k) behavior is not characterized in this release.

License

mit — inherits from the upstream parent deepseek-ai/DeepSeek-R1-Distill-Llama-8B. Read it before redistribution.

Made on a single DGX Spark. Questions or feedback? File an issue at Spark NVFP4 Lab on GitHub.

Downloads last month: 232

Safetensors

Model size

5B params

Tensor type

BF16

F8_E4M3

Model tree for GumbiiDigital/DeepSeek-R1-Distill-Llama-8B-NVFP4

Base model

deepseek-ai/DeepSeek-R1-Distill-Llama-8B

Quantized

(191)

this model