smol-llama 🦙 (360M) — from-scratch LLaMA-style pretraining

This repository contains smol-llama, a ~360M parameter LLaMA-architecture causal LM trained from scratch for next-token prediction.
This project was primarily an educational + engineering effort to reproduce a SmolLM-like training setup at a smaller scale.

TL;DR: Wanted to see if it's possible to actually pretrain an LLM from scratch end-to-end — data pipeline → tokenizer → training → checkpoint → Hub.

Model Description

smol-llama is a compact implementation of the LLaMA architecture, featuring modern techniques like Grouped Query Attention (GQA), RoPE embeddings, and SwiGLU activations. It was trained on the weights-and-wires/fineweb-6b dataset.

NOTE: This is an early checkpoint / research-y base model. Expect imperfect generations.

Model Architecture

Component	Value
Parameters	360M
Hidden Dimension	960
Layers	32
Attention Heads	15 (Query) / 5 (KV)
Head Dimension	64
Context Length	2048
Vocabulary Size	49,152
Architecture	LLaMA-style decoder-only

Key Features:

Grouped Query Attention (GQA): 3:1 ratio for efficient inference
RoPE: Rotary Position Embeddings for better length generalization
RMSNorm: Root Mean Square Layer Normalization
SwiGLU: Gated linear unit activation in FFN
Flash Attention 2: Memory-efficient attention computation
Gradient Checkpointing: Enables training larger batches

Training Details

Dataset

Trained on weights-and-wires/fineweb-6b, a curated subset of the FineWeb dataset containing ~6 billion high-quality web tokens.

Training Hyperparameters

Hyperparameter	Value
Optimizer	AdamW (fused)
Learning Rate	3e-4 (peak)
LR Schedule	Cosine with linear warmup
Warmup Steps	900
Total Steps	5,725 (~1 epoch)
Batch Size	64
Gradient Accumulation	8
Effective Batch Size	512 sequences
Context Length	2048 tokens
Tokens per Step	~1M
Total Tokens	~6B
Precision	bfloat16
Gradient Clipping	1.0

Infrastructure

Resource	Specification
GPU	1× NVIDIA H100 (80GB PCIe)
Training Time	~22 hours
Throughput	~75,000 tokens/sec
Cloud Provider	RunPod
Cost	~$53 (total)

Training Loss

The model was trained for one full epoch over the dataset with checkpoints saved every 200 steps. Final training loss: ~2.8 (see training checkpoints for intermediate metrics).

Quick Start

Installation

uv add torch transformers accelerate

Basic Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "weights-and-wires/smol-llama"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
  model_name,
  torch_dtype=torch.bfloat16,
  device_map="auto"
)

# Generate text
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Remove token_type_ids if present (not used by LLaMA models)
if 'token_type_ids' in inputs:
  del inputs['token_type_ids']

outputs = model.generate(
  **inputs,
  max_new_tokens=100,
  temperature=0.7,
  top_p=0.9,
  do_sample=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Generation

# More controlled generation
outputs = model.generate(
  **inputs,
  max_new_tokens=200,
  temperature=0.8,
  top_k=50,
  top_p=0.95,
  repetition_penalty=1.1,
  do_sample=True,
  pad_token_id=tokenizer.eos_token_id,
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Batch Generation

prompts = [
  "Once upon a time",
  "The key to success is",
  "In the year 2050,",
]

inputs = tokenizer(prompts, return_tensors="pt", padding=True).to(model.device)

outputs = model.generate(
  **inputs,
  max_new_tokens=50,
  temperature=0.7,
  do_sample=True,
  pad_token_id=tokenizer.eos_token_id,
)

for i, output in enumerate(outputs):
  print(f"\nPrompt {i+1}: {prompts[i]}")
  print(f"Generated: {tokenizer.decode(output, skip_special_tokens=True)}")

Loading from Custom Checkpoint Format

If you want to load the original training checkpoints:

import torch
from transformers import PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained("weights-and-wires/smol-llama")

# Load custom checkpoint
checkpoint_path = "training_checkpoints/checkpoint_step_5000.pt"
ckpt = torch.load(checkpoint_path, map_location="cuda")

# Create model from scratch (you'll need the model definition)
from utils.model import Llama, ModelArgs
model = Llama(ModelArgs()).cuda().to(torch.bfloat16)

# Handle torch.compile prefix if present
state_dict = {k.replace("_orig_mod.", ""): v for k, v in ckpt['model'].items()}
model.load_state_dict(state_dict)
model.eval()

# Generate
def generate(prompt, max_tokens=50):
  input_ids = tokenizer.encode(prompt, return_tensors="pt").cuda()
  
  with torch.no_grad():
    for _ in range(max_tokens):
      logits, _ = model(input_ids[:, -2048:])
      next_token = logits[:, -1, :].argmax(dim=-1, keepdim=True)
      input_ids = torch.cat([input_ids, next_token], dim=1)
      if next_token.item() == tokenizer.eos_token_id:
        break
  
  return tokenizer.decode(input_ids[0])

print(generate("The meaning of life is"))

Training Checkpoints

Intermediate training checkpoints are available in the training_checkpoints/ folder:

Checkpoint	Steps	Tokens Seen	Loss
`checkpoint_step_200.pt`	200	~200M	-
`checkpoint_step_400.pt`	400	~400M	-
...	...	...	-
`checkpoint_step_4800.pt`	4,800	~4.8B	-
`checkpoint_step_5000.pt`	5,000	~5B	-

These checkpoints include full training state (model, optimizer, step, loss) and can be used to resume training or analyze training dynamics.

Limitations

This is a small model trained on a limited dataset (~6B tokens) for demonstration purposes. As such, it has several limitations:

Limited Knowledge: The model has only seen 6B tokens, compared to 100B+ for larger models
Generalization: May not perform well on out-of-distribution tasks
Factual Accuracy: Should not be relied upon for factual information
Biases: Inherits biases present in the web-scraped training data
No Instruction Tuning: This is a base model without instruction following or chat capabilities
No Safety Alignment: Has not undergone safety training or RLHF

Intended Use

This model is intended for:

Research and experimentation with small language models
Educational purposes and learning about LLM pre-training
Fine-tuning on downstream tasks
Exploring efficient training techniques
Prototyping and proof-of-concept projects

This model is NOT intended for:

Production deployments without further fine-tuning
Safety-critical applications
Generating factual information without verification
Applications requiring instruction following (use an instruction-tuned variant)

Training Code

The complete pre-training code is available in the model repository. Key components:

# Clone the repository
git clone https://github.com/weights-and-wires/smol-llama
cd smol-llama

# Install dependencies
uv add sync

# Run training (requires GPU)
uv run pretrain.py

See the repository files for complete implementation details including:

Custom LLaMA architecture (utils/model.py)
Rotary embeddings (utils/rotary.py)
Data loading utilities (utils/data.py)
Checkpoint management (utils/checkpoint.py)
Learning rate scheduling (utils/lr_schedule.py)

Citation

If you use this model in your research, please cite:

@misc{smol-llama-2026,
  author = {Kashif, Ananya},
  title = {smol-llama: A 360M Parameter LLaMA Model Trained From Scratch},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/weights-and-wires/smol-llama}
}

Also consider citing the FineWeb dataset:

@inproceedings{
  penedo2024the,
  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024},
  url={https://openreview.net/forum?id=n6SCkn2QaG}
}

Resources

Model Repository: weights-and-wires/smol-llama
Training Dataset: weights-and-wires/fineweb-6b
Reference Implementation: HuggingFaceTB/SmolLM-360M

License

This model is released under the MIT License. See the LICENSE file for details.

Acknowledgments

Inspired by HuggingFaceTB/SmolLM-360M
Trained on FineWeb data
Built with PyTorch and Transformers

Downloads last month: 7

weights-and-wires
/

smol-llama