Instructions to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound",
	filename="GLM-4.6-q2_k_s-00001-of-00015.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
# Run inference directly in the terminal:
llama-cli -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
# Run inference directly in the terminal:
llama-cli -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
# Run inference directly in the terminal:
./llama-cli -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
# Run inference directly in the terminal:
./build/bin/llama-cli -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S

Use Docker

docker model run hf.co/INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S

LM Studio
Jan
Ollama
How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with Ollama:
```
ollama run hf.co/INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
```

Unsloth Studio new

How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound to start chatting

Pi new

How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S

Run Hermes

hermes

Docker Model Runner
How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with Docker Model Runner:
```
docker model run hf.co/INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S
```

Lemonade

How to use INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound:Q2_K_S

Run and chat with the model

lemonade run user.GLM-4.6-gguf-q2ks-mixed-AutoRound-Q2_K_S

List all available models

lemonade list

Model Details

This model is a mixed gguf q2ks format of zai-org/GLM-4.6 generated by intel/auto-round algorithm. Embedding layer and lm-head layer are fallback to 8 bits and non expert layers are fallback to 4 bits. Please refer to Section Generate the model for more details.

Please follow the license of the original model.

How To Use

Llamacpp inference

/llama-cli -hf Intel/GLM-4.6-gguf-q2ks-mixed-AutoRound

> Write a quick sort algorithm.
<think>The user asks: "Write a quick sort algorithm."\n\nWe need to output a quick sort algorithm. The request is a standard programming request, no disallowed content. So we can comply.\n\nWe need to produce code for quick sort. It could be in various language: Python, Java, C, etc. We can choose a language. Usually, quick sort is a classic algorithm.\n\nWe can provide a straightforward implementation with recursion. The user didn\'t specify language, but we can provide Python implementation (most common). Also we can discuss the algorithm.\n\nGiven the request, we can comply and produce a code snippet.\n\nWe might also include explanation.\n\nThus answer: Provide quick sort algorithm in Python.\n\nWill also include typical implementation with in-place partition.\n\nWe can also discuss complexity.\n\nThus we will answer accordingly.\n\nWe must not mention policy or guidelines.\n\nHence produce code and explanation.\n\nPotentially also provide alternative implementation.\n\nThus final answer will be a quick sort algorithm.\n\nWe might also provide a version for generic types.\n\nLet\'s produce Python code:\n\n```\ndef quicksort(arr):\n    if len(arr) <= 1:\n        return arr\n    pivot = arr[0]\n    less = [x for x in arr[1:] if x <= pivot]\n    greater = [x for x in arr[1:] if x > pivot]\n    return quicksort(less) + [pivot] + quicksort(greater)\n```\n\nAlternatively, in-place version:\n\n```\ndef quicksort_inplace(arr, low, high):\n    if low < high:\n        pi = partition(arr, low, high)\n        quicksort_inplace(arr, low, pi-1)\n        quicksort_inplace(arr, pi+1, high)\n```\n\nWe can also provide partition function.\n\nThus final answer.\n\nGiven the instruction to be brief but thorough, we can provide both.\n\nThus answer: Provide quick sort algorithm.\n\nWe must not output any disallowed content.\n\nThus final answer is safe.\n\nNow produce the final.</think>
Below is a simple, self‑contained implementation of the Quick‑Sort algorithm in Python. Two versions are shown:

* **A functional (non‑in‑place) version** that builds new lists for the recursive calls.
* **An in‑place version** that sorts the original list by swapping elements around a pivot.

Both have the same expected time‑complexity: *O(n log n)* on average (and *O(n²)* in the worst case, e.g., when the pivot is always the smallest or largest element).

---

## 1️ ⃣ Functional (non‑in‑place) Quick‑Sort

```python
def quicksort(arr):
    """
    Return a new list containing the elements of `arr` sorted.
    This version is not in‑place; it creates new sub‑lists for each recursion.
    """
    # Base case: a list of length 0 or 1 is already sorted
    if len(arr) <= 1:
        return arr

    # Choose the first element as the pivot (any element works)
    pivot = arr[0]

    # Partition the remaining elements into three groups:
    #   * less   – elements <= pivot
    #   * equal  – elements == pivot (excluding the pivot itself)
    #   * greater– elements > pivot
    less = [x for x in arr[1:] if x <= pivot]
    greater = [x for x in arr[1:] if x > pivot]

    # Recursively sort the partitions and concatenate the results
    return quicksort(less) + [pivot] + quicksort(greater)


# Example usage:
if __name__ == "__main__":
    unsorted = [33, 10, 55, 71, 29, 3, 18, 42]
    print("Original:", unsorted)
    print("Sorted:", quicksort(unsorted))
```

**Explanation**

1. **Base case** – If the list has 0 or 1 elements, it’s already sorted.
2. **Pivot selection** – Here we simply pick the first element; you could also use a random element or “median‑of‑three” for better performance on pathological inputs.
3. **Partition** – Two list comprehensions split the remaining items into those ≤ pivot and those > pivot.
4. **Recursive sorting** – The two sub‑lists are sorted recursively, then combined with the pivot to produce the final sorted list.

---

## 2️ ⃣ In‑Place Quick‑Sort (Lomuto partition scheme)

```python
def quicksort_inplace(arr, low=0, high=None):
    """
    Sort `arr` in place using Quick‑Sort.
    Parameters:
        arr  – list of comparable items.
        low  – starting index of the sub‑array to sort.
        high – ending index (inclusive) of the sub‑array; if None, uses len(arr)-1.
    """
    if high is None:
        high = len(arr) - 1

    # Continue only if there are at least two elements to sort
    if low < high:
        # Partition the array and obtain the pivot's final position
        pivot_index = partition(arr, low, high)

        # Recursively sort the sub‑arrays left and right of the pivot
        quicksort_inplace(arr, low, pivot_index - 1)
        quicksort_inplace(arr, pivot_index + 1, high)


def partition(arr, low, high):
    """
    Lomuto partition: choose the element at `high` as pivot,
    then rearrange the sub‑array so that all elements <= pivot are left
    and all > pivot are right. Returns the final index of the pivot.
    """
    pivot = arr[high]          # pivot value
    i = low - 1                # index of the smaller element

    for j in range(low, high):
        if arr[j] <= pivot:
            i += 1
            arr[i], arr[j] = arr[j], arr[i]   # swap

    # Place pivot after the last smaller element
    arr[i + 1], arr[high] = arr[high], arr[i + 1]
    return i + 1


# Example usage:
if __name__ == "__main__":
    data = [33, 10, 55, 71, 29, 3, 18, 42]
    print("Original:", data)
    quicksort_inplace(data)
    print("Sorted:", data)
```

**Explanation**

* **`partition`** – The Lomuto scheme picks the last element as the pivot. It walks the sub‑array, moving any element ≤ pivot to the front. After the loop, the pivot is swapped into its correct final position, and its index (`i+1`) is returned.
* **Recursion** – The list is sorted recursively on the two halves separated by the pivot.

---

## 3️ ⃣ Tips & Variations

| Variation | Why use it? |
|-----------|-------------|
| **Random pivot** (`pivot = arr[random.randint(low, high)]`) | Reduces chance of worst‑case *O(n²)* on already sorted inputs. |
| **Median‑of‑three pivot** (choose median of first, middle, last) | Improves partition balance for many real‑world datasets. |
| **Tail‑call optimization** (recurse on the smaller side, loop on the larger) | Keeps recursion depth ≤ log n, preventing stack overflow on large lists. |
| **Hybrid with Insertion Sort** (switch to insertion sort for sub‑arrays ≤ 10 elements) | Quick‑Sort overhead is high for tiny partitions; insertion sort is faster there. |

---

### TL;DR

* Use the **functional version** when you want a concise, immutable‑style implementation or you’re dealing with small data where extra list allocations are cheap.
* Use the **in‑place version** when you need to sort large data without extra memory allocation.

Both achieve the classic Quick‑Sort behavior: fast average performance, elegant divide‑and‑conquer recursion, and a simple pivot‑based partition step. Happy sorting!

Generate the model

Here is the sample command to reproduce the model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound

model_name = "zai-org/GLM-4.6"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="cpu", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
layer_config = {}
for n, m in model.named_modules():
    if n == "lm_head" or isinstance(m,torch.nn.Embedding):
        layer_config[n] = {"bits": 8}
    elif isinstance(m, torch.nn.Linear) and (not "expert" in n or "shared_experts" in n) and n != "lm_head":
        layer_config[n] = {"bits": 4}

autoround = AutoRound(model, tokenizer, iters=0, layer_config=layer_config, nsamples=512)
autoround.quantize_and_save("tmp_autoround", format="gguf:q2_k_s")

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

Intel Neural Compressor link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

Downloads last month: 31

GGUF

Model size

357B params

Architecture

glm4moe

Hardware compatibility

2-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound

Base model

zai-org/GLM-4.6

Quantized

(40)

this model

Paper for INC4AI/GLM-4.6-gguf-q2ks-mixed-AutoRound

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Paper • 2309.05516 • Published Sep 11, 2023 • 12