ruGPT-3 XL (HuggingFace format)

A 1.3B-parameter GPT-3-style language model for Russian, converted from the original ai-forever/rugpt3xl Megatron-LM checkpoint into a native HuggingFace transformers format.

This is a base (pretrained) model, not instruction-tuned. It performs text completion and can be fine-tuned for downstream tasks.

See more in "A family of pretrained transformer language models for Russian" paper.

Model Details

Parameter Value
Parameters 1.3B
Architecture GPT-3 (decoder-only transformer)
Hidden size 2048
Layers 24
Attention heads 16
FFN intermediate size 8192
Max sequence length 2048
Vocabulary 50,264 tokens (BPE)
Activation GELU
Normalization Pre-LayerNorm
Position encoding Learned absolute
Precision float16
Training data 80B tokens of Russian text (4 epochs)
Test perplexity 12.05

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "evilfreelancer/ruGPT3XL"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)

inputs = tokenizer("Москва - столица", return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.2,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Loading Options

GPU (float16, recommended):

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)

CPU (float32):

import torch

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, dtype=torch.float32, device_map="cpu"
)

Chat Template

The tokenizer includes a simple chat template for question-answering:

messages = [
    {"role": "user", "content": "Какая столица России?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Output: "Вопрос: Какая столица России?\n\nОтвет: "

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note: This is a base model, not an instruction-tuned chatbot. The chat template provides a basic structure, but the model may not always follow instructions precisely. For reliable conversational behavior, fine-tune the model on instruction/chat data.

Fine-tuning

The model is fully compatible with standard HuggingFace training workflows.

Full Fine-tuning with Trainer

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

args = TrainingArguments(
    output_dir="./rugpt3xl-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    fp16=True,
    save_strategy="epoch",
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=your_dataset,  # dataset with input_ids, attention_mask, labels
)
trainer.train()

LoRA Fine-tuning with PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~14M || all params: 1.4B || trainable%: ~1.0%

SFT with TRL

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig, TaskType
from datasets import Dataset

model = AutoModelForCausalLM.from_pretrained(
    model_name, trust_remote_code=True, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

# Dataset with chat messages format
train_data = [
    {"messages": [
        {"role": "user", "content": "Какая столица России?"},
        {"role": "assistant", "content": "Москва - столица Российской Федерации."},
    ]},
    # ... more examples
]
dataset = Dataset.from_list(train_data)

sft_config = SFTConfig(
    output_dir="./rugpt3xl-sft",
    max_steps=1000,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    logging_steps=10,
    max_length=512,
)

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
    peft_config=lora_config,
    processing_class=tokenizer,
)
trainer.train()

Supported Fine-tuning Features

Feature Status
Full parameter training Supported
Gradient checkpointing Supported
LoRA / PEFT Supported
TRL SFTTrainer Supported
DeepSpeed ZeRO Supported
FSDP Supported
KV cache during generation Supported
labels argument for loss computation Supported

LoRA target modules: q_proj, k_proj, v_proj, o_proj, up_proj, down_proj

Architecture Details

The model implements a custom RuGPT3XLForCausalLM class (loaded via trust_remote_code=True):

RuGPT3XLForCausalLM
  ├── model (RuGPT3XLModel)
  │     ├── embed_tokens       (Embedding: 50264 x 2048)
  │     ├── embed_positions    (Embedding: 2048 x 2048)
  │     ├── embed_dropout      (Dropout: 0.1)
  │     ├── layers (x24)       (RuGPT3XLDecoderLayer)
  │     │     ├── input_layernorm          (LayerNorm: 2048)
  │     │     ├── self_attn                (RuGPT3XLAttention)
  │     │     │     ├── q_proj             (Linear: 2048 -> 2048)
  │     │     │     ├── k_proj             (Linear: 2048 -> 2048)
  │     │     │     ├── v_proj             (Linear: 2048 -> 2048)
  │     │     │     ├── o_proj             (Linear: 2048 -> 2048)
  │     │     │     ├── attn_dropout       (Dropout: 0.1)
  │     │     │     └── resid_dropout      (Dropout: 0.1)
  │     │     ├── post_attention_layernorm (LayerNorm: 2048)
  │     │     └── mlp                      (RuGPT3XMLP)
  │     │           ├── up_proj            (Linear: 2048 -> 8192)
  │     │           ├── down_proj          (Linear: 8192 -> 2048)
  │     │           ├── act_fn             (GELU)
  │     │           └── dropout            (Dropout: 0.1)
  │     └── norm               (LayerNorm: 2048)
  └── lm_head                  (Linear: 2048 -> 50264, no bias)

Conversion

This model was converted from the original Megatron-LM checkpoint using a custom script. The conversion performs the following transformations:

  1. Strips the module. prefix from parameter names (FP16 / DDP wrappers)
  2. Remaps Megatron-LM naming to HuggingFace convention
  3. Splits the fused QKV projection ([6144, 2048]) into separate Q, K, V ([2048, 2048] each)
  4. Saves weights in safetensors format

For full conversion details and the script, see the rugpt3xl-convert repository.

Limitations

  • This is a base model trained on Russian internet text. It may generate biased, factually incorrect, or offensive content.
  • The model was trained primarily on Russian text. It has limited capability in other languages.
  • Maximum context length is 2048 tokens. Inputs longer than this will be truncated.
  • The model is not instruction-tuned and works best for text completion rather than following specific instructions.

Citation

@misc{rugpt3xl,
  title={ruGPT-3 XL},
  author={SberDevices Team},
  year={2021},
  publisher={Hugging Face},
  url={https://huggingface.co/ai-forever/rugpt3xl}
}

Links

Downloads last month
352
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for evilfreelancer/ruGPT3XL

Finetuned
(1)
this model