Instructions to use bullpoint/GLM-4.6-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bullpoint/GLM-4.6-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bullpoint/GLM-4.6-AWQ") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("bullpoint/GLM-4.6-AWQ") model = AutoModelForCausalLM.from_pretrained("bullpoint/GLM-4.6-AWQ") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use bullpoint/GLM-4.6-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bullpoint/GLM-4.6-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bullpoint/GLM-4.6-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/bullpoint/GLM-4.6-AWQ
- SGLang
How to use bullpoint/GLM-4.6-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bullpoint/GLM-4.6-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bullpoint/GLM-4.6-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bullpoint/GLM-4.6-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bullpoint/GLM-4.6-AWQ", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use bullpoint/GLM-4.6-AWQ with Docker Model Runner:
docker model run hf.co/bullpoint/GLM-4.6-AWQ
GLM-4.6-FP8 - 55 tokens/sec on 4x RTX 6000 PRO
Hello,
I'm getting 55 tokens/sec with sglang using triton for the FP8 version
docker run -it --rm -v /mnt:/mnt/ --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all --network host lmsysorg/sglang:b200-cu129 bash
NCCL_P2P_LEVEL=4 NCCL_DEBUG=INFO PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True USE_TRITON_W8A8_FP8_KERNEL=1 SGL_ENABLE_JIT_DEEPGEMM=0 python -m sglang.launch_server --model /mnt/GLM-4.6-FP8/ --tp 4 --host 0.0.0.0 --port 4999 --mem-fraction-static 0.96 --context-length 200000 --enable-metrics --attention-backend flashinfer --tool-call-parser glm45 --reasoning-parser glm45 --served-model-name glm-4.5-air --chunked-prefill-size 8092 --enable-mixed-chunk --cuda-graph-max-bs 16 --kv-cache-dtype fp8_e5m2 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
look for the missing .json and copy voipmonitor.org/sm120.json to it (typically something like E=128,N=704,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition,dtype=fp8_w8a8.json)
question - I suppose NVFP4 is still not implemented anywhere?
I've not seen an NVFP4 quant yet. 55 t/s is pretty great for FP8, that's about what I get with this AWQ. My FP8 on VLLM is about 30 t/s. Have you tried using EAGLE for MTP with sglang? I'm currently distilling 4.6 for coding to see if I can train some medusa heads for MTP. Unfortunately is going to take a while.
I've not seen an NVFP4 quant yet. 55 t/s is pretty great for FP8, that's about what I get with this AWQ. My FP8 on VLLM is about 30 t/s. Have you tried using EAGLE for MTP with sglang? I'm currently distilling 4.6 for coding to see if I can train some medusa heads for MTP. Unfortunately is going to take a while.
vllm FP8 uses cutlass which is not that fast as triton fp8 implementation for sm120. I have enabled triton path for fp8 but you have to set USE_TRITON_W8A8_FP8_KERNEL
Here is NVFP4 quant: https://huggingface.co/RESMP-DEV/GLM-4.6-NVFP4
the problem is that I cant find any inference engine supporting NVFP4 block scale on sm120
I believe that we should get double the speed once used nvfp4 natively - if we have 55 with fp8, we should get 110 for nvfp4
EAGLE MTP for GLM-4.6 is memory bound and is slower than not using it. but thats different with FP8 GLM-4.5-Air-FP8 - the eagle works - I'm getting 180 tokens/sec on 4 cards.
USE_TRITON_W8A8_FP8_KERNEL=1 SGL_ENABLE_JIT_DEEPGEMM=false PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python -m sglang.launch_server --model /mnt/GLM-4.5-Air-FP8/ --tp 4 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --host 0.0.0.0 --port 5000 --mem-fraction-static 0.80 --context-length 128000 --enable-metrics --attention-backend flashinfer --tool-call-parser glm45 --reasoning-parser glm45 --served-model-name glm-4.5-air --chunked-prefill-size 64736 --enable-mixed-chunk --cuda-graph-max-bs 1024 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
festr2 does this AWQ quant run on your rtx 6000s? im getting some errors. and the other AWQ by QuantTrio loads but is not outputing rerasoning and stopping generations properly. in sglang maybe ill have better luck in vllm on these awqs?
update: works good in vllm! handles parallel requests and parallel tool calls great! thx!!
update: works good in vllm! handles parallel requests and parallel tool calls great! thx!!
you mean FP8?
update: works good in vllm! handles parallel requests and parallel tool calls great! thx!!
you mean FP8?
fp8 i did in sglang and works great! i meant this awq doesnt work for me in sglang but does load up and work great in vllm