Instructions to use AaryanK/GLM-4.7-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use AaryanK/GLM-4.7-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="AaryanK/GLM-4.7-GGUF", filename="GLM-4.7.q2_k.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use AaryanK/GLM-4.7-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AaryanK/GLM-4.7-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf AaryanK/GLM-4.7-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AaryanK/GLM-4.7-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf AaryanK/GLM-4.7-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf AaryanK/GLM-4.7-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf AaryanK/GLM-4.7-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf AaryanK/GLM-4.7-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf AaryanK/GLM-4.7-GGUF:Q4_K_M
Use Docker
docker model run hf.co/AaryanK/GLM-4.7-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use AaryanK/GLM-4.7-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AaryanK/GLM-4.7-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AaryanK/GLM-4.7-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/AaryanK/GLM-4.7-GGUF:Q4_K_M
- Ollama
How to use AaryanK/GLM-4.7-GGUF with Ollama:
ollama run hf.co/AaryanK/GLM-4.7-GGUF:Q4_K_M
- Unsloth Studio new
How to use AaryanK/GLM-4.7-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AaryanK/GLM-4.7-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AaryanK/GLM-4.7-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for AaryanK/GLM-4.7-GGUF to start chatting
- Pi new
How to use AaryanK/GLM-4.7-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AaryanK/GLM-4.7-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "AaryanK/GLM-4.7-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use AaryanK/GLM-4.7-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AaryanK/GLM-4.7-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default AaryanK/GLM-4.7-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use AaryanK/GLM-4.7-GGUF with Docker Model Runner:
docker model run hf.co/AaryanK/GLM-4.7-GGUF:Q4_K_M
- Lemonade
How to use AaryanK/GLM-4.7-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull AaryanK/GLM-4.7-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.GLM-4.7-GGUF-Q4_K_M
List all available models
lemonade list
GLM-4.7-GGUF
I am currently looking for open positions! ๐ค If you find this model useful or are looking for a talented AI/LLM Engineer, please reach out to me on LinkedIn: Aaryan Kapoor.
Description
This repository contains GGUF format model files for Zhipu AI's GLM-4.7.
GLM-4.7 is a powerful open-weights model designed for complex reasoning, agentic coding, and tool use. It supports "Thinking" (Chain of Thought) natively within its chat template.
Performances on Benchmarks. More detailed comparisons of GLM-4.7 with other models GPT-5-High, GPT-5.1-High, Claude Sonnet 4.5, Gemini 3.0 Pro, DeepSeek-V3.2, Kimi K2 Thinking, on 17 benchmarks (including 8 reasoning, 5 coding, and 3 agents benchmarks) can be seen in the below table.
| Benchmark | GLM-4.7 | GLM-4.6 | Kimi K2 Thinking | DeepSeek-V3.2 | Gemini 3.0 Pro | Claude Sonnet 4.5 | GPT-5-High | GPT-5.1-High |
|---|---|---|---|---|---|---|---|---|
| MMLU-Pro | 84.3 | 83.2 | 84.6 | 85.0 | 90.1 | 88.2 | 87.5 | 87.0 |
| GPQA-Diamond | 85.7 | 81.0 | 84.5 | 82.4 | 91.9 | 83.4 | 85.7 | 88.1 |
| HLE | 24.8 | 17.2 | 23.9 | 25.1 | 37.5 | 13.7 | 26.3 | 25.7 |
| HLE (w/ Tools) | 42.8 | 30.4 | 44.9 | 40.8 | 45.8 | 32.0 | 35.2 | 42.7 |
| AIME 2025 | 95.7 | 93.9 | 94.5 | 93.1 | 95.0 | 87.0 | 94.6 | 94.0 |
| HMMT Feb. 2025 | 97.1 | 89.2 | 89.4 | 92.5 | 97.5 | 79.2 | 88.3 | 96.3 |
| HMMT Nov. 2025 | 93.5 | 87.7 | 89.2 | 90.2 | 93.3 | 81.7 | 89.2 | - |
| IMOAnswerBench | 82.0 | 73.5 | 78.6 | 78.3 | 83.3 | 65.8 | 76.0 | - |
| LiveCodeBench-v6 | 84.9 | 82.8 | 83.1 | 83.3 | 90.7 | 64.0 | 87.0 | 87.0 |
| SWE-bench Verified | 73.8 | 68.0 | 71.3 | 73.1 | 76.2 | 77.2 | 74.9 | 76.3 |
| SWE-bench Multilingual | 66.7 | 53.8 | 61.1 | 70.2 | - | 68.0 | 55.3 | - |
| Terminal Bench Hard | 33.3 | 23.6 | 30.6 | 35.4 | 39.0 | 33.3 | 30.5 | 43.0 |
| Terminal Bench 2.0 | 41.0 | 24.5 | 35.7 | 46.4 | 54.2 | 42.8 | 35.2 | 47.6 |
| BrowseComp | 52.0 | 45.1 | - | 51.4 | - | 24.1 | 54.9 | 50.8 |
| BrowseComp (w/ Context Manage) | 67.5 | 57.5 | 60.2 | 67.6 | 59.2 | - | - | - |
| BrowseComp-Zh | 66.6 | 49.5 | 62.3 | 65.0 | - | 42.4 | 63.0 | - |
| ฯยฒ-Bench | 87.4 | 75.2 | 74.3 | 85.3 | 90.7 | 87.2 | 82.4 | 82.7 |
How to Run (llama.cpp)
Important: This model uses "Thinking" (Chain of Thought), which consumes significant context. You must increase the generation limit (-n) and specify stop tokens to prevent infinite loops.
1. CLI Inference (Interactive Chat)
./llama-cli -m GLM-4.7.Q4_K_M.gguf \
-n 2048 \ # Allow enough tokens for "Thinking"
-c 8192 \ # Adjust context based on VRAM
--temp 0.7 \ # Recommended for reasoning
--top-p 0.9 \
-ngl 99 \ # Offload layers to GPU (Reduce if OOM)
-r "<|user|>,<|observation|>" \ # CRITICAL: Prevents infinite generation loops
-cnv \ # Enable Conversation Mode
-p "Hello"
Note: If you want to see the internal "Thinking" process (the text between
<think>tags), add the--specialflag to the command.
2. Server Mode (API)
Running a persistent server is recommended for this size model to avoid reloading times.
./llama-server -m GLM-4.7.Q4_K_M.gguf \
--port 8080 \
-ngl 99 \
-c 8192 \
-n 2048 \
--alias glm4
API Request Example (JSON):
When using the API, ensure you include the stop tokens in your payload:
{
"model": "glm4",
"messages": [
{ "role": "user", "content": "Explain quantum computing." }
],
"stop": ["<|user|>", "<|observation|>"],
"max_tokens": 2048
}
Hardware Requirements
Full GPU Offloading (
-ngl 99): Requires ~130GB VRAM for Q4_K_M (e.g., 2x A100 80GB or Mac Studio Ultra).Split Offloading: For single A100 (80GB) cards, use Q2_K or IQ2_XXS and set
-ngl 40(adjust based on available VRAM) to split the model between GPU and System RAM. Default Settings (Most Tasks)temperature:
1.0top-p:
0.95max new tokens:
131072
For multi-turn agentic tasks (ฯยฒ-Bench and Terminal Bench 2), please turn on Preserved Thinking mode.
CLI Example
./llama-cli -m GLM-4.7.Q4_K_M.gguf \
-c 8192 \
--temp 1.0 \
--top-p 0.95 \
-p "[gMASK]<sop><|system|>\nYou are a helpful assistant.<|user|>\nWrite a Python script to calculate Fibonacci numbers.<|assistant|>\n<think>" \
-cnv
- Downloads last month
- 411
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for AaryanK/GLM-4.7-GGUF
Base model
zai-org/GLM-4.7