PicoWord-5k

Have you ever dreamed of a model that could fit on your toaster oven? Well, meet PicoWord, the smallest variant in the The Word family. PicoWord is a five-thousand parameter transformer trained on seven-hundred and fifty-three thousand words.

Architecture

Parameter	Value
NUM_HIDDEN_LAYERS	1
HIDDEN_SIZE	4
NUM_ATTENTION_HEADS	1
NUM_KEY_VALUE_HEADS	1
VOCAB_SIZE	1200
INTERMEDIATE_SIZE	16
ROPE_THETA	1000.0
MAX_POSITION_EMBEDDINGS	32
TIE_WORD_EMBEDDINGS	True

Note:

No layers. NO. NO LAYERS. Fine.. Maybe ONE layer.

Training

Dataset

Key	Value
Entries (words)	753,232
Tokens	3,225,398
Characters	7,022,310
Avg. Tokens Per Entry	~4.2
Avg. Words Per Entry	1
Avg. Chars Per Entry	~9.3
Longest Entry (Tokens)	36
Shortest Entry (Tokens)	1
English Words	~660k
Spanish Words	~90k

Hardware

PicoWord was trained on one NVIDIA RTX 2060 GPU for 8 epochs with a batch size of 64.

Training Results

Step	Epoch	Train Loss	Train PPL	Eval Loss	Val PPL
3000	0.78	4.9279	138.09	4.8546	128.35
6000	1.56	4.5972	99.20	4.5742	97.00
9000	2.34	4.4828	88.48	4.4717	87.52
12000	3.12	4.4399	84.78	4.4343	84.30
15000	3.90	4.4240	83.43	4.4129	82.52
18000	4.68	4.4136	82.57	4.4039	81.77
21000	5.45	4.3984	81.33	4.3911	80.74
24000	6.23	4.3982	81.32	4.3876	80.45
27000	7.01	4.3919	80.80	4.3842	80.17
30000	7.79	4.3926	80.85	4.3815	79.98

Due to the small parameter size, PicoWord plateued early, which is expected.

Generations

Generation1

Prompt: wh Output:

why

Generation2

Prompt: a Output:

aitee

Generatiion3

Prompt: w Output:

wiized

The generations are varied but as you can tell, most generations will not be real words. The goal is to reflect the morphology of the English and Spanish languages, not memorized word generation.

Limitations

It does not generate sentences, prose, code, or anything besides a single word-like sequence.
It cannot reason or produce complex language.
Generated words may or may not be real. The goal isn't real word generation but reflecting the lexicon and morphology of the English and Spanish languages through tiny language models.
Output is non-deterministic. The same prompt can produce very different completions across runs.
Some continuations may be empty. This is because, most words are tiny, and the model feels like the input does not need to be continued further.

Inference

# =============================================================================
# Inference
# =============================================================================

MODEL_DIR      = "harley-ml/PicoWord-5k"   # path
TOKENIZER_PATH = "harley-ml/PicoWord-5k"

# --- Generation settings ---
PROMPT             = "w"   # prompt
MAX_NEW_TOKENS     = 32
TEMPERATURE        = 1.2
TOP_P              = 0.95
TOP_K              = 200
REPETITION_PENALTY = 1.1
DO_SAMPLE          = True

# =============================================================================

import torch
from pathlib import Path
from transformers import (
    AutoModelForCausalLM,
    PreTrainedTokenizerFast,
    AddedToken,
)

# ---------------------------------------------------------------------------
# Device
# ---------------------------------------------------------------------------

device = (
    "cuda" if torch.cuda.is_available() else
    "mps"  if torch.backends.mps.is_available() else
    "cpu"
)
print(f"Device : {device}")

# ---------------------------------------------------------------------------
# Tokenizer  (mirrors training setup)
# ---------------------------------------------------------------------------

def load_tokenizer(path: str):
    p = Path(path).resolve()
    if not p.exists():
        raise FileNotFoundError(f"Tokenizer not found: {p}")
    tok = PreTrainedTokenizerFast(tokenizer_file=str(p))
    specials = {}
    if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True)
    if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True)
    if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True)
    if tok.pad_token is None:
        if tok.eos_token is not None:
            tok.pad_token = tok.eos_token
        else:
            specials["pad_token"] = AddedToken("<|pad|>", special=True)
    if specials:
        tok.add_special_tokens(specials)
    tok.padding_side = "left"  # left-pad for batched generation
    return tok

print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f"  Vocab size : {tokenizer.vocab_size}")
print(f"  BOS        : {tokenizer.bos_token!r}")
print(f"  EOS        : {tokenizer.eos_token!r}")
print(f"  PAD        : {tokenizer.pad_token!r}  (id={tokenizer.pad_token_id})")

# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------

print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR,
    dtype=torch.float16 if device == "cuda" else torch.float32,
    low_cpu_mem_usage=True,
)
model.eval()
model.to(device)

total_params = sum(p.numel() for p in model.parameters())
print(f"  Parameters : {total_params:,}")

# ---------------------------------------------------------------------------
# Generation helper
# ---------------------------------------------------------------------------

def generate(
    prompt: str             = PROMPT,
    max_new_tokens: int     = MAX_NEW_TOKENS,
    temperature: float      = TEMPERATURE,
    top_p: float            = TOP_P,
    top_k: int              = TOP_K,
    repetition_penalty: float = REPETITION_PENALTY,
    do_sample: bool         = DO_SAMPLE,
) -> str:
    
    bos         = tokenizer.bos_token or ""
    full_prompt = bos + prompt

    inputs = tokenizer(
        full_prompt,
        return_tensors="pt",
        add_special_tokens=False,
    ).to(device)
    inputs.pop("token_type_ids", None)  # Qwen3 doesn't use this

    gen_kwargs = dict(
        max_new_tokens     = max_new_tokens,
        do_sample          = do_sample,
        repetition_penalty = repetition_penalty,
        eos_token_id       = tokenizer.eos_token_id,
        pad_token_id       = tokenizer.pad_token_id,
    )
    if do_sample:
        gen_kwargs["temperature"] = temperature
        gen_kwargs["top_p"]       = top_p
        gen_kwargs["top_k"]       = top_k

    with torch.inference_mode():
        output_ids = model.generate(**inputs, **gen_kwargs)

    # Strip the prompt tokens so we only return what was generated
    prompt_len = inputs["input_ids"].shape[-1]
    new_ids    = output_ids[0][prompt_len:]
    return tokenizer.decode(new_ids, skip_special_tokens=True)


# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    print(f"\nPrompt : {PROMPT!r}")
    print("-" * 60)

    output = generate(PROMPT)

    print("Generated:")
    print(output)

Related Models

Downloads last month: 1,310

Safetensors

Model size

5.09k params

Tensor type

F32

Dataset used to train Harley-ml/PicoWord-5k

Collection including Harley-ml/PicoWord-5k

WordGen

Collection

They all scale from five thousand parameters to five-hundred and fifty-nine thousand parameters. • 6 items • Updated 12 days ago