Instructions to use ai-sage/GigaChat-20B-A3B-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ai-sage/GigaChat-20B-A3B-instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ai-sage/GigaChat-20B-A3B-instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("ai-sage/GigaChat-20B-A3B-instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ai-sage/GigaChat-20B-A3B-instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ai-sage/GigaChat-20B-A3B-instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ai-sage/GigaChat-20B-A3B-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ai-sage/GigaChat-20B-A3B-instruct

SGLang

How to use ai-sage/GigaChat-20B-A3B-instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ai-sage/GigaChat-20B-A3B-instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ai-sage/GigaChat-20B-A3B-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ai-sage/GigaChat-20B-A3B-instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ai-sage/GigaChat-20B-A3B-instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ai-sage/GigaChat-20B-A3B-instruct with Docker Model Runner:
```
docker model run hf.co/ai-sage/GigaChat-20B-A3B-instruct
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

GigaChat-20B-A3B-instruct

Диалоговая модель из семейства моделей GigaChat, основная на GigaChat-20B-A3B-base. Поддерживает контекст в 131 тысячу токенов.

This repository contains the instructed model of GigaChat Family: Efficient Russian Language Modeling Through Mixture of Experts Architecture.

Больше подробностей в хабр статье.

Для данной модели также доступны веса в bf16 и int8

Upd. Перезалили веса в .safetensors

Бенчмарки

	T-lite-instruct-0.1 (llama 3.0 8B based)	gemma-2-9b-it	GigaChat-20B-A3B-instruct
MERA	0.335	0.392	0.513
ru-MMLU 5-shot	0.555	0.625	0.598
Shlepa	0.36	0.388	0.482

Семейство GigaChat

	GigaChat-20B-A3B-instruct	GigaChat-Pro v26.20	GigaChat-Max v26.20
Математические задачи
GSM8K 5-shot	0,763	0,782	0,929
MATH 4-shot	0,426	0,446	0,53
Написание кода
HumanEval 0-shot	0,329	0,439	0,64
MBPP 0-shot	0,385	0,487	0,667
Общие знания
MMLU EN 5-shot	0,648	0,687	0,804
MMLU RU 5-shot Переведенные данные из MMLU EN 5-shot	0,598	0,645	0,75
MMLU RU 1-shot	—	0,617	0,718
MMLU PRO EN 5-shot	0,348	0,431	0,589
RUBQ 0-shot	0,675	0,724	0,73
WINOGRANDE 4-shot	0,75	0,796	0,832
CyberMetric 0-shot	0,798	0,827	0,864
Следование инструкциям
IFEval 0-shot	0,411	0,566	0,721

Особенности замеров

GSM8k — это тест, который проверяет, как хорошо модели могут решать задачи с числами. В нашем исследовании мы использовали 5 шотов, чтобы оценить модель, и смотрели на последнее число в ответе. В оригинальное тесте ответ ищется по шаблону: ‘### число’.

Тест Math тоже имеет разные версии, которые проверяют математические способности моделей. В нашем исследовании мы давали 4 примера и смотрели на последнее выражение в формате '\boxed{expression}'. Затем оценивали результаты на совпадение с помощью библиотеки sympy.

Requirements

transformers>=4.47

Пример использования через transformers

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "ai-sage/GigaChat-20B-A3B-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)

messages = [
    {"role": "user", "content": "Докажи теорему о неподвижной точке"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device))

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=False)
print(result)

Пример использования через vLLM

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "ai-sage/GigaChat-20B-A3B-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, trust_remote_code=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=8192)

messages_list = [
    [{"role": "user", "content": "Докажи теорему о неподвижной точке"}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

В GigaChat-20B-A3B-instruct используется особый способ токенизации текста, поэтому не рекомендуется следующий сценарий

input_string = tokenizer.apply_chat_template(messages,tokenize=False, add_generation_prompt=True)
input_tensor = tokenizer(input_string, return_tensors="pt")

Пример использования vLLM server

Запуск сервера

vllm serve ai-sage/GigaChat-20B-A3B-instruct  \
  --disable-log-requests \
  --trust_remote_code \
  --dtype bfloat16 \
  --max-seq-len 8192

Пример запроса

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "ai-sage/GigaChat-20B-A3B-instruct" ,
        "messages": [
            {"role": "system", "content": "Ты ОЧЕНЬ умный математик"},
            {"role": "user", "content": "Докажи теорему о неподвижной точке"}
        ]
    }'