Instructions to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound", filename="GLM-4.6-q2_k_s-00001-of-00015.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S # Run inference directly in the terminal: llama-cli -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S # Run inference directly in the terminal: llama-cli -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S # Run inference directly in the terminal: ./llama-cli -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S # Run inference directly in the terminal: ./build/bin/llama-cli -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
Use Docker
docker model run hf.co/INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
- LM Studio
- Jan
- Ollama
How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with Ollama:
ollama run hf.co/INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
- Unsloth Studio new
How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound to start chatting
- Pi new
How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
Run Hermes
hermes
- Docker Model Runner
How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with Docker Model Runner:
docker model run hf.co/INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
- Lemonade
How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
Run and chat with the model
lemonade run user.GLM-4.6-gguf-q2ks-mixed-AutoRound-Q2_K_S
List all available models
lemonade list
Model Details
This model is a mixed gguf q2ks format of zai-org/GLM-4.6 generated by intel/auto-round algorithm. Embedding layer and lm-head layer are fallback to 8 bits and non expert layers are fallback to 4 bits. Please refer to Section Generate the model for more details.
Please follow the license of the original model.
How To Use
Llamacpp inference
/llama-cli -hf Intel/GLM-4.6-gguf-q2ks-mixed-AutoRound
> Write a quick sort algorithm.
<think>The user asks: "Write a quick sort algorithm."\n\nWe need to output a quick sort algorithm. The request is a standard programming request, no disallowed content. So we can comply.\n\nWe need to produce code for quick sort. It could be in various language: Python, Java, C, etc. We can choose a language. Usually, quick sort is a classic algorithm.\n\nWe can provide a straightforward implementation with recursion. The user didn\'t specify language, but we can provide Python implementation (most common). Also we can discuss the algorithm.\n\nGiven the request, we can comply and produce a code snippet.\n\nWe might also include explanation.\n\nThus answer: Provide quick sort algorithm in Python.\n\nWill also include typical implementation with in-place partition.\n\nWe can also discuss complexity.\n\nThus we will answer accordingly.\n\nWe must not mention policy or guidelines.\n\nHence produce code and explanation.\n\nPotentially also provide alternative implementation.\n\nThus final answer will be a quick sort algorithm.\n\nWe might also provide a version for generic types.\n\nLet\'s produce Python code:\n\n```\ndef quicksort(arr):\n if len(arr) <= 1:\n return arr\n pivot = arr[0]\n less = [x for x in arr[1:] if x <= pivot]\n greater = [x for x in arr[1:] if x > pivot]\n return quicksort(less) + [pivot] + quicksort(greater)\n```\n\nAlternatively, in-place version:\n\n```\ndef quicksort_inplace(arr, low, high):\n if low < high:\n pi = partition(arr, low, high)\n quicksort_inplace(arr, low, pi-1)\n quicksort_inplace(arr, pi+1, high)\n```\n\nWe can also provide partition function.\n\nThus final answer.\n\nGiven the instruction to be brief but thorough, we can provide both.\n\nThus answer: Provide quick sort algorithm.\n\nWe must not output any disallowed content.\n\nThus final answer is safe.\n\nNow produce the final.</think>
Below is a simple, self‑contained implementation of the Quick‑Sort algorithm in Python. Two versions are shown:
* **A functional (non‑in‑place) version** that builds new lists for the recursive calls.
* **An in‑place version** that sorts the original list by swapping elements around a pivot.
Both have the same expected time‑complexity: *O(n log n)* on average (and *O(n²)* in the worst case, e.g., when the pivot is always the smallest or largest element).
---
## 1️ ⃣ Functional (non‑in‑place) Quick‑Sort
```python
def quicksort(arr):
"""
Return a new list containing the elements of `arr` sorted.
This version is not in‑place; it creates new sub‑lists for each recursion.
"""
# Base case: a list of length 0 or 1 is already sorted
if len(arr) <= 1:
return arr
# Choose the first element as the pivot (any element works)
pivot = arr[0]
# Partition the remaining elements into three groups:
# * less – elements <= pivot
# * equal – elements == pivot (excluding the pivot itself)
# * greater– elements > pivot
less = [x for x in arr[1:] if x <= pivot]
greater = [x for x in arr[1:] if x > pivot]
# Recursively sort the partitions and concatenate the results
return quicksort(less) + [pivot] + quicksort(greater)
# Example usage:
if __name__ == "__main__":
unsorted = [33, 10, 55, 71, 29, 3, 18, 42]
print("Original:", unsorted)
print("Sorted:", quicksort(unsorted))
```
**Explanation**
1. **Base case** – If the list has 0 or 1 elements, it’s already sorted.
2. **Pivot selection** – Here we simply pick the first element; you could also use a random element or “median‑of‑three” for better performance on pathological inputs.
3. **Partition** – Two list comprehensions split the remaining items into those ≤ pivot and those > pivot.
4. **Recursive sorting** – The two sub‑lists are sorted recursively, then combined with the pivot to produce the final sorted list.
---
## 2️ ⃣ In‑Place Quick‑Sort (Lomuto partition scheme)
```python
def quicksort_inplace(arr, low=0, high=None):
"""
Sort `arr` in place using Quick‑Sort.
Parameters:
arr – list of comparable items.
low – starting index of the sub‑array to sort.
high – ending index (inclusive) of the sub‑array; if None, uses len(arr)-1.
"""
if high is None:
high = len(arr) - 1
# Continue only if there are at least two elements to sort
if low < high:
# Partition the array and obtain the pivot's final position
pivot_index = partition(arr, low, high)
# Recursively sort the sub‑arrays left and right of the pivot
quicksort_inplace(arr, low, pivot_index - 1)
quicksort_inplace(arr, pivot_index + 1, high)
def partition(arr, low, high):
"""
Lomuto partition: choose the element at `high` as pivot,
then rearrange the sub‑array so that all elements <= pivot are left
and all > pivot are right. Returns the final index of the pivot.
"""
pivot = arr[high] # pivot value
i = low - 1 # index of the smaller element
for j in range(low, high):
if arr[j] <= pivot:
i += 1
arr[i], arr[j] = arr[j], arr[i] # swap
# Place pivot after the last smaller element
arr[i + 1], arr[high] = arr[high], arr[i + 1]
return i + 1
# Example usage:
if __name__ == "__main__":
data = [33, 10, 55, 71, 29, 3, 18, 42]
print("Original:", data)
quicksort_inplace(data)
print("Sorted:", data)
```
**Explanation**
* **`partition`** – The Lomuto scheme picks the last element as the pivot. It walks the sub‑array, moving any element ≤ pivot to the front. After the loop, the pivot is swapped into its correct final position, and its index (`i+1`) is returned.
* **Recursion** – The list is sorted recursively on the two halves separated by the pivot.
---
## 3️ ⃣ Tips & Variations
| Variation | Why use it? |
|-----------|-------------|
| **Random pivot** (`pivot = arr[random.randint(low, high)]`) | Reduces chance of worst‑case *O(n²)* on already sorted inputs. |
| **Median‑of‑three pivot** (choose median of first, middle, last) | Improves partition balance for many real‑world datasets. |
| **Tail‑call optimization** (recurse on the smaller side, loop on the larger) | Keeps recursion depth ≤ log n, preventing stack overflow on large lists. |
| **Hybrid with Insertion Sort** (switch to insertion sort for sub‑arrays ≤ 10 elements) | Quick‑Sort overhead is high for tiny partitions; insertion sort is faster there. |
---
### TL;DR
* Use the **functional version** when you want a concise, immutable‑style implementation or you’re dealing with small data where extra list allocations are cheap.
* Use the **in‑place version** when you need to sort large data without extra memory allocation.
Both achieve the classic Quick‑Sort behavior: fast average performance, elegant divide‑and‑conquer recursion, and a simple pivot‑based partition step. Happy sorting!
Generate the model
Here is the sample command to reproduce the model
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound
model_name = "zai-org/GLM-4.6"
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="cpu", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
layer_config = {}
for n, m in model.named_modules():
if n == "lm_head" or isinstance(m,torch.nn.Embedding):
layer_config[n] = {"bits": 8}
elif isinstance(m, torch.nn.Linear) and (not "expert" in n or "shared_experts" in n) and n != "lm_head":
layer_config[n] = {"bits": 4}
autoround = AutoRound(model, tokenizer, iters=0, layer_config=layer_config, nsamples=512)
autoround.quantize_and_save("tmp_autoround", format="gguf:q2_k_s")
Ethical Considerations and Limitations
The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.
Therefore, before deploying any applications of the model, developers should perform safety testing.
Caveats and Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
Here are a couple of useful links to learn more about Intel's AI software:
- Intel Neural Compressor link
Disclaimer
The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.
Cite
@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }
- Downloads last month
- 31
2-bit
Model tree for INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound
Base model
zai-org/GLM-4.6