smol-llama ๐Ÿฆ™ (360M) โ€” from-scratch LLaMA-style pretraining

This repository contains smol-llama, a ~360M parameter LLaMA-architecture causal LM trained from scratch for next-token prediction.
This project was primarily an educational + engineering effort to reproduce a SmolLM-like training setup at a smaller scale.

TL;DR: Wanted to see if it's possible to actually pretrain an LLM from scratch end-to-end โ€” data pipeline โ†’ tokenizer โ†’ training โ†’ checkpoint โ†’ Hub.

Model Description

smol-llama is a compact implementation of the LLaMA architecture, featuring modern techniques like Grouped Query Attention (GQA), RoPE embeddings, and SwiGLU activations. It was trained on the weights-and-wires/fineweb-6b dataset.

NOTE: This is an early checkpoint / research-y base model. Expect imperfect generations.


Model Architecture

Component Value
Parameters 360M
Hidden Dimension 960
Layers 32
Attention Heads 15 (Query) / 5 (KV)
Head Dimension 64
Context Length 2048
Vocabulary Size 49,152
Architecture LLaMA-style decoder-only

Key Features:

  • Grouped Query Attention (GQA): 3:1 ratio for efficient inference
  • RoPE: Rotary Position Embeddings for better length generalization
  • RMSNorm: Root Mean Square Layer Normalization
  • SwiGLU: Gated linear unit activation in FFN
  • Flash Attention 2: Memory-efficient attention computation
  • Gradient Checkpointing: Enables training larger batches

Training Details

Dataset

Trained on weights-and-wires/fineweb-6b, a curated subset of the FineWeb dataset containing ~6 billion high-quality web tokens.

Training Hyperparameters

Hyperparameter Value
Optimizer AdamW (fused)
Learning Rate 3e-4 (peak)
LR Schedule Cosine with linear warmup
Warmup Steps 900
Total Steps 5,725 (~1 epoch)
Batch Size 64
Gradient Accumulation 8
Effective Batch Size 512 sequences
Context Length 2048 tokens
Tokens per Step ~1M
Total Tokens ~6B
Precision bfloat16
Gradient Clipping 1.0

Infrastructure

Resource Specification
GPU 1ร— NVIDIA H100 (80GB PCIe)
Training Time ~22 hours
Throughput ~75,000 tokens/sec
Cloud Provider RunPod
Cost ~$53 (total)

Training Loss

The model was trained for one full epoch over the dataset with checkpoints saved every 200 steps. Final training loss: ~2.8 (see training checkpoints for intermediate metrics).

Quick Start

Installation

uv add torch transformers accelerate

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "weights-and-wires/smol-llama"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype=torch.bfloat16,
  device_map="auto"
)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Remove token_type_ids if present (not used by LLaMA models)
if 'token_type_ids' in inputs:
  del inputs['token_type_ids']

outputs = model.generate(
  **inputs,
  max_new_tokens=100,
  temperature=0.7,
  top_p=0.9,
  do_sample=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Generation

# More controlled generation
outputs = model.generate(
  **inputs,
  max_new_tokens=200,
  temperature=0.8,
  top_k=50,
  top_p=0.95,
  repetition_penalty=1.1,
  do_sample=True,
  pad_token_id=tokenizer.eos_token_id,
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Batch Generation

prompts = [
  "Once upon a time",
  "The key to success is",
  "In the year 2050,",
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)

outputs = model.generate(
  **inputs,
  max_new_tokens=50,
  temperature=0.7,
  do_sample=True,
  pad_token_id=tokenizer.eos_token_id,
)

for i, output in enumerate(outputs):
  print(f"\nPrompt {i+1}: {prompts[i]}")
  print(f"Generated: {tokenizer.decode(output, skip_special_tokens=True)}")

Loading from Custom Checkpoint Format

If you want to load the original training checkpoints:

import torch
from transformers import PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("weights-and-wires/smol-llama")

# Load custom checkpoint
checkpoint_path = "training_checkpoints/checkpoint_step_5000.pt"
ckpt = torch.load(checkpoint_path, map_location="cuda")

# Create model from scratch (you'll need the model definition)
from utils.model import Llama, ModelArgs
model = Llama(ModelArgs()).cuda().to(torch.bfloat16)

# Handle torch.compile prefix if present
state_dict = {k.replace("_orig_mod.", ""): v for k, v in ckpt['model'].items()}
model.load_state_dict(state_dict)
model.eval()

# Generate
def generate(prompt, max_tokens=50):
  input_ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
  
  with torch.no_grad():
    for _ in range(max_tokens):
      logits, _ = model(input_ids[:, -2048:])
      next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)
      input_ids = torch.cat([input_ids, next_token], dim=1)
      if next_token.item() == tokenizer.eos_token_id:
        break
  
  return tokenizer.decode(input_ids[0])

print(generate("The meaning of life is"))

Training Checkpoints

Intermediate training checkpoints are available in the training_checkpoints/ folder:

Checkpoint Steps Tokens Seen Loss
checkpoint_step_200.pt 200 ~200M -
checkpoint_step_400.pt 400 ~400M -
... ... ... -
checkpoint_step_4800.pt 4,800 ~4.8B -
checkpoint_step_5000.pt 5,000 ~5B -

These checkpoints include full training state (model, optimizer, step, loss) and can be used to resume training or analyze training dynamics.

Limitations

This is a small model trained on a limited dataset (~6B tokens) for demonstration purposes. As such, it has several limitations:

  • Limited Knowledge: The model has only seen 6B tokens, compared to 100B+ for larger models
  • Generalization: May not perform well on out-of-distribution tasks
  • Factual Accuracy: Should not be relied upon for factual information
  • Biases: Inherits biases present in the web-scraped training data
  • No Instruction Tuning: This is a base model without instruction following or chat capabilities
  • No Safety Alignment: Has not undergone safety training or RLHF

Intended Use

This model is intended for:

  • Research and experimentation with small language models
  • Educational purposes and learning about LLM pre-training
  • Fine-tuning on downstream tasks
  • Exploring efficient training techniques
  • Prototyping and proof-of-concept projects

This model is NOT intended for:

  • Production deployments without further fine-tuning
  • Safety-critical applications
  • Generating factual information without verification
  • Applications requiring instruction following (use an instruction-tuned variant)

Training Code

The complete pre-training code is available in the model repository. Key components:

# Clone the repository
git clone https://github.com/weights-and-wires/smol-llama
cd smol-llama

# Install dependencies
uv add sync

# Run training (requires GPU)
uv run pretrain.py

See the repository files for complete implementation details including:

  • Custom LLaMA architecture (utils/model.py)
  • Rotary embeddings (utils/rotary.py)
  • Data loading utilities (utils/data.py)
  • Checkpoint management (utils/checkpoint.py)
  • Learning rate scheduling (utils/lr_schedule.py)

Citation

If you use this model in your research, please cite:

@misc{smol-llama-2026,
  author = {Kashif, Ananya},
  title = {smol-llama: A 360M Parameter LLaMA Model Trained From Scratch},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/weights-and-wires/smol-llama}
}

Also consider citing the FineWeb dataset:

@inproceedings{
  penedo2024the,
  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024},
  url={https://openreview.net/forum?id=n6SCkn2QaG}
}

Resources

License

This model is released under the MIT License. See the LICENSE file for details.

Acknowledgments

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train weights-and-wires/smol-llama