Instructions to use bullpoint/GLM-4.6-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use bullpoint/GLM-4.6-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="bullpoint/GLM-4.6-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("bullpoint/GLM-4.6-AWQ")
model = AutoModelForCausalLM.from_pretrained("bullpoint/GLM-4.6-AWQ")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use bullpoint/GLM-4.6-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "bullpoint/GLM-4.6-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bullpoint/GLM-4.6-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/bullpoint/GLM-4.6-AWQ

SGLang

How to use bullpoint/GLM-4.6-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "bullpoint/GLM-4.6-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bullpoint/GLM-4.6-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "bullpoint/GLM-4.6-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bullpoint/GLM-4.6-AWQ",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use bullpoint/GLM-4.6-AWQ with Docker Model Runner:
```
docker model run hf.co/bullpoint/GLM-4.6-AWQ
```

GLM-4.6-FP8 - 55 tokens/sec on 4x RTX 6000 PRO

by festr2 - opened Oct 11, 2025

Discussion

festr2

Oct 11, 2025

Hello,

I'm getting 55 tokens/sec with sglang using triton for the FP8 version

docker run -it --rm -v /mnt:/mnt/ --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all --network host lmsysorg/sglang:b200-cu129 bash

NCCL_P2P_LEVEL=4 NCCL_DEBUG=INFO PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True USE_TRITON_W8A8_FP8_KERNEL=1 SGL_ENABLE_JIT_DEEPGEMM=0 python -m sglang.launch_server --model /mnt/GLM-4.6-FP8/ --tp 4 --host 0.0.0.0 --port 4999 --mem-fraction-static 0.96 --context-length 200000 --enable-metrics --attention-backend flashinfer --tool-call-parser glm45 --reasoning-parser glm45 --served-model-name glm-4.5-air --chunked-prefill-size 8092 --enable-mixed-chunk --cuda-graph-max-bs 16 --kv-cache-dtype fp8_e5m2 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'

look for the missing .json and copy voipmonitor.org/sm120.json to it (typically something like E=128,N=704,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition,dtype=fp8_w8a8.json)

question - I suppose NVFP4 is still not implemented anywhere?

bullpoint

Owner Oct 11, 2025

I've not seen an NVFP4 quant yet. 55 t/s is pretty great for FP8, that's about what I get with this AWQ. My FP8 on VLLM is about 30 t/s. Have you tried using EAGLE for MTP with sglang? I'm currently distilling 4.6 for coding to see if I can train some medusa heads for MTP. Unfortunately is going to take a while.

festr2

Oct 11, 2025

I've not seen an NVFP4 quant yet. 55 t/s is pretty great for FP8, that's about what I get with this AWQ. My FP8 on VLLM is about 30 t/s. Have you tried using EAGLE for MTP with sglang? I'm currently distilling 4.6 for coding to see if I can train some medusa heads for MTP. Unfortunately is going to take a while.

vllm FP8 uses cutlass which is not that fast as triton fp8 implementation for sm120. I have enabled triton path for fp8 but you have to set USE_TRITON_W8A8_FP8_KERNEL

Here is NVFP4 quant: https://huggingface.co/RESMP-DEV/GLM-4.6-NVFP4
the problem is that I cant find any inference engine supporting NVFP4 block scale on sm120
I believe that we should get double the speed once used nvfp4 natively - if we have 55 with fp8, we should get 110 for nvfp4

EAGLE MTP for GLM-4.6 is memory bound and is slower than not using it. but thats different with FP8 GLM-4.5-Air-FP8 - the eagle works - I'm getting 180 tokens/sec on 4 cards.

USE_TRITON_W8A8_FP8_KERNEL=1 SGL_ENABLE_JIT_DEEPGEMM=false PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python -m sglang.launch_server --model /mnt/GLM-4.5-Air-FP8/ --tp 4 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --host 0.0.0.0 --port 5000 --mem-fraction-static 0.80 --context-length 128000 --enable-metrics --attention-backend flashinfer --tool-call-parser glm45 --reasoning-parser glm45 --served-model-name glm-4.5-air --chunked-prefill-size 64736 --enable-mixed-chunk --cuda-graph-max-bs 1024 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'

Fernanda24

Oct 16, 2025

festr2 does this AWQ quant run on your rtx 6000s? im getting some errors. and the other AWQ by QuantTrio loads but is not outputing rerasoning and stopping generations properly. in sglang maybe ill have better luck in vllm on these awqs?

Fernanda24

Oct 16, 2025

update: works good in vllm! handles parallel requests and parallel tool calls great! thx!!

festr2

Oct 16, 2025

update: works good in vllm! handles parallel requests and parallel tool calls great! thx!!

you mean FP8?

Fernanda24

Oct 17, 2025

update: works good in vllm! handles parallel requests and parallel tool calls great! thx!!

you mean FP8?

fp8 i did in sglang and works great! i meant this awq doesnt work for me in sglang but does load up and work great in vllm

bullpoint changed discussion status to closed Nov 28, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment