ruGPT-3 XL (HuggingFace format)
A 1.3B-parameter GPT-3-style language model for Russian, converted from the original
ai-forever/rugpt3xl Megatron-LM checkpoint
into a native HuggingFace transformers format.
This is a base (pretrained) model, not instruction-tuned. It performs text completion and can be fine-tuned for downstream tasks.
See more in "A family of pretrained transformer language models for Russian" paper.
Model Details
| Parameter | Value |
|---|---|
| Parameters | 1.3B |
| Architecture | GPT-3 (decoder-only transformer) |
| Hidden size | 2048 |
| Layers | 24 |
| Attention heads | 16 |
| FFN intermediate size | 8192 |
| Max sequence length | 2048 |
| Vocabulary | 50,264 tokens (BPE) |
| Activation | GELU |
| Normalization | Pre-LayerNorm |
| Position encoding | Learned absolute |
| Precision | float16 |
| Training data | 80B tokens of Russian text (4 epochs) |
| Test perplexity | 12.05 |
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "evilfreelancer/ruGPT3XL"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, device_map="auto"
)
inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.2,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Loading Options
GPU (float16, recommended):
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, device_map="auto"
)
CPU (float32):
import torch
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, dtype=torch.float32, device_map="cpu"
)
Chat Template
The tokenizer includes a simple chat template for question-answering:
messages = [
{"role": "user", "content": "Какая столица России?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Output: "Вопрос: Какая столица России?\n\nОтвет: "
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Note: This is a base model, not an instruction-tuned chatbot. The chat template provides a basic structure, but the model may not always follow instructions precisely. For reliable conversational behavior, fine-tune the model on instruction/chat data.
Fine-tuning
The model is fully compatible with standard HuggingFace training workflows.
Full Fine-tuning with Trainer
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
args = TrainingArguments(
output_dir="./rugpt3xl-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
fp16=True,
save_strategy="epoch",
logging_steps=10,
)
trainer = Trainer(
model=model,
args=args,
train_dataset=your_dataset, # dataset with input_ids, attention_mask, labels
)
trainer.train()
LoRA Fine-tuning with PEFT
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, device_map="auto"
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~14M || all params: 1.4B || trainable%: ~1.0%
SFT with TRL
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, TaskType
from datasets import Dataset
model = AutoModelForCausalLM.from_pretrained(
model_name, trust_remote_code=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
# Dataset with chat messages format
train_data = [
{"messages": [
{"role": "user", "content": "Какая столица России?"},
{"role": "assistant", "content": "Москва - столица Российской Федерации."},
]},
# ... more examples
]
dataset = Dataset.from_list(train_data)
sft_config = SFTConfig(
output_dir="./rugpt3xl-sft",
max_steps=1000,
per_device_train_batch_size=4,
learning_rate=2e-5,
logging_steps=10,
max_length=512,
)
trainer = SFTTrainer(
model=model,
args=sft_config,
train_dataset=dataset,
peft_config=lora_config,
processing_class=tokenizer,
)
trainer.train()
Supported Fine-tuning Features
| Feature | Status |
|---|---|
| Full parameter training | Supported |
| Gradient checkpointing | Supported |
| LoRA / PEFT | Supported |
| TRL SFTTrainer | Supported |
| DeepSpeed ZeRO | Supported |
| FSDP | Supported |
| KV cache during generation | Supported |
labels argument for loss computation |
Supported |
LoRA target modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj
Architecture Details
The model implements a custom RuGPT3XLForCausalLM class (loaded via trust_remote_code=True):
RuGPT3XLForCausalLM
├── model (RuGPT3XLModel)
│ ├── embed_tokens (Embedding: 50264 x 2048)
│ ├── embed_positions (Embedding: 2048 x 2048)
│ ├── embed_dropout (Dropout: 0.1)
│ ├── layers (x24) (RuGPT3XLDecoderLayer)
│ │ ├── input_layernorm (LayerNorm: 2048)
│ │ ├── self_attn (RuGPT3XLAttention)
│ │ │ ├── q_proj (Linear: 2048 -> 2048)
│ │ │ ├── k_proj (Linear: 2048 -> 2048)
│ │ │ ├── v_proj (Linear: 2048 -> 2048)
│ │ │ ├── o_proj (Linear: 2048 -> 2048)
│ │ │ ├── attn_dropout (Dropout: 0.1)
│ │ │ └── resid_dropout (Dropout: 0.1)
│ │ ├── post_attention_layernorm (LayerNorm: 2048)
│ │ └── mlp (RuGPT3XMLP)
│ │ ├── up_proj (Linear: 2048 -> 8192)
│ │ ├── down_proj (Linear: 8192 -> 2048)
│ │ ├── act_fn (GELU)
│ │ └── dropout (Dropout: 0.1)
│ └── norm (LayerNorm: 2048)
└── lm_head (Linear: 2048 -> 50264, no bias)
Conversion
This model was converted from the original Megatron-LM checkpoint using a custom script. The conversion performs the following transformations:
- Strips the
module.prefix from parameter names (FP16 / DDP wrappers) - Remaps Megatron-LM naming to HuggingFace convention
- Splits the fused QKV projection (
[6144, 2048]) into separate Q, K, V ([2048, 2048]each) - Saves weights in safetensors format
For full conversion details and the script, see the rugpt3xl-convert repository.
Limitations
- This is a base model trained on Russian internet text. It may generate biased, factually incorrect, or offensive content.
- The model was trained primarily on Russian text. It has limited capability in other languages.
- Maximum context length is 2048 tokens. Inputs longer than this will be truncated.
- The model is not instruction-tuned and works best for text completion rather than following specific instructions.
Citation
@misc{rugpt3xl,
title={ruGPT-3 XL},
author={SberDevices Team},
year={2021},
publisher={Hugging Face},
url={https://huggingface.co/ai-forever/rugpt3xl}
}
Links
- A family of pretrained transformer language models for Russian - paper on Google Scholar
- ai-forever/rugpt3xl - original model
- ai-forever/ru-gpts - original training codebase
- Downloads last month
- 352
Model tree for evilfreelancer/ruGPT3XL
Base model
ai-forever/rugpt3xl