ruGPT-3 XL (HuggingFace format)

A 1.3B-parameter GPT-3-style language model for Russian, converted from the original ai-forever/rugpt3xl Megatron-LM checkpoint into a native HuggingFace transformers format.

This is a base (pretrained) model, not instruction-tuned. It performs text completion and can be fine-tuned for downstream tasks.

See more in "A family of pretrained transformer language models for Russian" paper.

Model Details

Parameter	Value
Parameters	1.3B
Architecture	GPT-3 (decoder-only transformer)
Hidden size	2048
Layers	24
Attention heads	16
FFN intermediate size	8192
Max sequence length	2048
Vocabulary	50,264 tokens (BPE)
Activation	GELU
Normalization	Pre-LayerNorm
Position encoding	Learned absolute
Precision	float16
Training data	80B tokens of Russian text (4 epochs)
Test perplexity	12.05

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "evilfreelancer/ruGPT3XL"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)

inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.2,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Loading Options

GPU (float16, recommended):

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)

CPU (float32):

import torch

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, dtype=torch.float32, device_map="cpu"
)

Chat Template

The tokenizer includes a simple chat template for question-answering:

messages = [
    {"role": "user", "content": "Какая столица России?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Output: "Вопрос: Какая столица России?\n\nОтвет: "

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note: This is a base model, not an instruction-tuned chatbot. The chat template provides a basic structure, but the model may not always follow instructions precisely. For reliable conversational behavior, fine-tune the model on instruction/chat data.

Fine-tuning

The model is fully compatible with standard HuggingFace training workflows.

Full Fine-tuning with Trainer

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

args = TrainingArguments(
    output_dir="./rugpt3xl-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    fp16=True,
    save_strategy="epoch",
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=your_dataset,  # dataset with input_ids, attention_mask, labels
)
trainer.train()

LoRA Fine-tuning with PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~14M || all params: 1.4B || trainable%: ~1.0%

SFT with TRL

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, TaskType
from datasets import Dataset

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# Dataset with chat messages format
train_data = [
    {"messages": [
        {"role": "user", "content": "Какая столица России?"},
        {"role": "assistant", "content": "Москва - столица Российской Федерации."},
    ]},
    # ... more examples
]
dataset = Dataset.from_list(train_data)

sft_config = SFTConfig(
    output_dir="./rugpt3xl-sft",
    max_steps=1000,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    logging_steps=10,
    max_length=512,
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    peft_config=lora_config,
    processing_class=tokenizer,
)
trainer.train()

Supported Fine-tuning Features

Feature	Status
Full parameter training	Supported
Gradient checkpointing	Supported
LoRA / PEFT	Supported
TRL SFTTrainer	Supported
DeepSpeed ZeRO	Supported
FSDP	Supported
KV cache during generation	Supported
`labels` argument for loss computation	Supported

LoRA target modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj

Architecture Details

The model implements a custom RuGPT3XLForCausalLM class (loaded via trust_remote_code=True):

RuGPT3XLForCausalLM
  ├── model (RuGPT3XLModel)
  │     ├── embed_tokens       (Embedding: 50264 x 2048)
  │     ├── embed_positions    (Embedding: 2048 x 2048)
  │     ├── embed_dropout      (Dropout: 0.1)
  │     ├── layers (x24)       (RuGPT3XLDecoderLayer)
  │     │     ├── input_layernorm          (LayerNorm: 2048)
  │     │     ├── self_attn                (RuGPT3XLAttention)
  │     │     │     ├── q_proj             (Linear: 2048 -> 2048)
  │     │     │     ├── k_proj             (Linear: 2048 -> 2048)
  │     │     │     ├── v_proj             (Linear: 2048 -> 2048)
  │     │     │     ├── o_proj             (Linear: 2048 -> 2048)
  │     │     │     ├── attn_dropout       (Dropout: 0.1)
  │     │     │     └── resid_dropout      (Dropout: 0.1)
  │     │     ├── post_attention_layernorm (LayerNorm: 2048)
  │     │     └── mlp                      (RuGPT3XMLP)
  │     │           ├── up_proj            (Linear: 2048 -> 8192)
  │     │           ├── down_proj          (Linear: 8192 -> 2048)
  │     │           ├── act_fn             (GELU)
  │     │           └── dropout            (Dropout: 0.1)
  │     └── norm               (LayerNorm: 2048)
  └── lm_head                  (Linear: 2048 -> 50264, no bias)

Conversion

This model was converted from the original Megatron-LM checkpoint using a custom script. The conversion performs the following transformations:

Strips the module. prefix from parameter names (FP16 / DDP wrappers)
Remaps Megatron-LM naming to HuggingFace convention
Splits the fused QKV projection ([6144, 2048]) into separate Q, K, V ([2048, 2048] each)
Saves weights in safetensors format

For full conversion details and the script, see the rugpt3xl-convert repository.

Limitations

This is a base model trained on Russian internet text. It may generate biased, factually incorrect, or offensive content.
The model was trained primarily on Russian text. It has limited capability in other languages.
Maximum context length is 2048 tokens. Inputs longer than this will be truncated.
The model is not instruction-tuned and works best for text completion rather than following specific instructions.

Citation

@misc{rugpt3xl,
  title={ruGPT-3 XL},
  author={SberDevices Team},
  year={2021},
  publisher={Hugging Face},
  url={https://huggingface.co/ai-forever/rugpt3xl}
}

Model tree for evilfreelancer/ruGPT3XL

Base model

ai-forever/rugpt3xl

Finetuned

(1)

this model

evilfreelancer
/

ruGPT3XL