PicoWord-5k

Have you ever dreamed of a model that could fit on your toaster oven? Well, meet PicoWord, the smallest variant in the The Word family. PicoWord is a five-thousand parameter transformer trained on seven-hundred and fifty-three thousand words.

Architecture

Parameter Value
NUM_HIDDEN_LAYERS 1
HIDDEN_SIZE 4
NUM_ATTENTION_HEADS 1
NUM_KEY_VALUE_HEADS 1
VOCAB_SIZE 1200
INTERMEDIATE_SIZE 16
ROPE_THETA 1000.0
MAX_POSITION_EMBEDDINGS 32
TIE_WORD_EMBEDDINGS True

Note:

No layers. NO. NO LAYERS. Fine.. Maybe ONE layer.

Training

Dataset

Key Value
Entries (words) 753,232
Tokens 3,225,398
Characters 7,022,310
Avg. Tokens Per Entry ~4.2
Avg. Words Per Entry 1
Avg. Chars Per Entry ~9.3
Longest Entry (Tokens) 36
Shortest Entry (Tokens) 1
English Words ~660k
Spanish Words ~90k

Hardware

PicoWord was trained on one NVIDIA RTX 2060 GPU for 8 epochs with a batch size of 64.

Training Results

Step Epoch Train Loss Train PPL Eval Loss Val PPL
3000 0.78 4.9279 138.09 4.8546 128.35
6000 1.56 4.5972 99.20 4.5742 97.00
9000 2.34 4.4828 88.48 4.4717 87.52
12000 3.12 4.4399 84.78 4.4343 84.30
15000 3.90 4.4240 83.43 4.4129 82.52
18000 4.68 4.4136 82.57 4.4039 81.77
21000 5.45 4.3984 81.33 4.3911 80.74
24000 6.23 4.3982 81.32 4.3876 80.45
27000 7.01 4.3919 80.80 4.3842 80.17
30000 7.79 4.3926 80.85 4.3815 79.98

Due to the small parameter size, PicoWord plateued early, which is expected.

Generations

Generation1

Prompt: wh Output:

why
Generation2

Prompt: a Output:

aitee
Generatiion3

Prompt: w Output:

wiized

The generations are varied but as you can tell, most generations will not be real words. The goal is to reflect the morphology of the English and Spanish languages, not memorized word generation.

Limitations

  1. It does not generate sentences, prose, code, or anything besides a single word-like sequence.
  2. It cannot reason or produce complex language.
  3. Generated words may or may not be real. The goal isn't real word generation but reflecting the lexicon and morphology of the English and Spanish languages through tiny language models.
  4. Output is non-deterministic. The same prompt can produce very different completions across runs.
  5. Some continuations may be empty. This is because, most words are tiny, and the model feels like the input does not need to be continued further.

Inference

# =============================================================================
# Inference
# =============================================================================

MODEL_DIR      = "harley-ml/PicoWord-5k"   # path
TOKENIZER_PATH = "harley-ml/PicoWord-5k"

# --- Generation settings ---
PROMPT             = "w"   # prompt
MAX_NEW_TOKENS     = 32
TEMPERATURE        = 1.2
TOP_P              = 0.95
TOP_K              = 200
REPETITION_PENALTY = 1.1
DO_SAMPLE          = True

# =============================================================================

import torch
from pathlib import Path
from transformers import (
    AutoModelForCausalLM,
    PreTrainedTokenizerFast,
    AddedToken,
)

# ---------------------------------------------------------------------------
# Device
# ---------------------------------------------------------------------------

device = (
    "cuda" if torch.cuda.is_available() else
    "mps"  if torch.backends.mps.is_available() else
    "cpu"
)
print(f"Device : {device}")

# ---------------------------------------------------------------------------
# Tokenizer  (mirrors training setup)
# ---------------------------------------------------------------------------

def load_tokenizer(path: str):
    p = Path(path).resolve()
    if not p.exists():
        raise FileNotFoundError(f"Tokenizer not found: {p}")
    tok = PreTrainedTokenizerFast(tokenizer_file=str(p))
    specials = {}
    if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True)
    if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True)
    if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True)
    if tok.pad_token is None:
        if tok.eos_token is not None:
            tok.pad_token = tok.eos_token
        else:
            specials["pad_token"] = AddedToken("<|pad|>", special=True)
    if specials:
        tok.add_special_tokens(specials)
    tok.padding_side = "left"  # left-pad for batched generation
    return tok

print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f"  Vocab size : {tokenizer.vocab_size}")
print(f"  BOS        : {tokenizer.bos_token!r}")
print(f"  EOS        : {tokenizer.eos_token!r}")
print(f"  PAD        : {tokenizer.pad_token!r}  (id={tokenizer.pad_token_id})")

# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------

print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_DIR,
    dtype=torch.float16 if device == "cuda" else torch.float32,
    low_cpu_mem_usage=True,
)
model.eval()
model.to(device)

total_params = sum(p.numel() for p in model.parameters())
print(f"  Parameters : {total_params:,}")

# ---------------------------------------------------------------------------
# Generation helper
# ---------------------------------------------------------------------------

def generate(
    prompt: str             = PROMPT,
    max_new_tokens: int     = MAX_NEW_TOKENS,
    temperature: float      = TEMPERATURE,
    top_p: float            = TOP_P,
    top_k: int              = TOP_K,
    repetition_penalty: float = REPETITION_PENALTY,
    do_sample: bool         = DO_SAMPLE,
) -> str:
    
    bos         = tokenizer.bos_token or ""
    full_prompt = bos + prompt

    inputs = tokenizer(
        full_prompt,
        return_tensors="pt",
        add_special_tokens=False,
    ).to(device)
    inputs.pop("token_type_ids", None)  # Qwen3 doesn't use this

    gen_kwargs = dict(
        max_new_tokens     = max_new_tokens,
        do_sample          = do_sample,
        repetition_penalty = repetition_penalty,
        eos_token_id       = tokenizer.eos_token_id,
        pad_token_id       = tokenizer.pad_token_id,
    )
    if do_sample:
        gen_kwargs["temperature"] = temperature
        gen_kwargs["top_p"]       = top_p
        gen_kwargs["top_k"]       = top_k

    with torch.inference_mode():
        output_ids = model.generate(**inputs, **gen_kwargs)

    # Strip the prompt tokens so we only return what was generated
    prompt_len = inputs["input_ids"].shape[-1]
    new_ids    = output_ids[0][prompt_len:]
    return tokenizer.decode(new_ids, skip_special_tokens=True)


# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------

if __name__ == "__main__":
    print(f"\nPrompt : {PROMPT!r}")
    print("-" * 60)

    output = generate(PROMPT)

    print("Generated:")
    print(output)

Related Models

  1. MicroWord
  2. TinyWord
  3. TinyWord2
  4. MediumWord
  5. LargeWord
Downloads last month
1,310
Safetensors
Model size
5.09k params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Harley-ml/PicoWord-5k

Collection including Harley-ml/PicoWord-5k