WordGen
Collection
They all scale from five thousand parameters to five-hundred and fifty-nine thousand parameters. • 6 items • Updated
Have you ever dreamed of a model that could fit on your toaster oven? Well, meet PicoWord, the smallest variant in the The Word family. PicoWord is a five-thousand parameter transformer trained on seven-hundred and fifty-three thousand words.
| Parameter | Value |
|---|---|
| NUM_HIDDEN_LAYERS | 1 |
| HIDDEN_SIZE | 4 |
| NUM_ATTENTION_HEADS | 1 |
| NUM_KEY_VALUE_HEADS | 1 |
| VOCAB_SIZE | 1200 |
| INTERMEDIATE_SIZE | 16 |
| ROPE_THETA | 1000.0 |
| MAX_POSITION_EMBEDDINGS | 32 |
| TIE_WORD_EMBEDDINGS | True |
Note:
No layers. NO. NO LAYERS. Fine.. Maybe ONE layer.
| Key | Value |
|---|---|
| Entries (words) | 753,232 |
| Tokens | 3,225,398 |
| Characters | 7,022,310 |
| Avg. Tokens Per Entry | ~4.2 |
| Avg. Words Per Entry | 1 |
| Avg. Chars Per Entry | ~9.3 |
| Longest Entry (Tokens) | 36 |
| Shortest Entry (Tokens) | 1 |
| English Words | ~660k |
| Spanish Words | ~90k |
PicoWord was trained on one NVIDIA RTX 2060 GPU for 8 epochs with a batch size of 64.
| Step | Epoch | Train Loss | Train PPL | Eval Loss | Val PPL |
|---|---|---|---|---|---|
| 3000 | 0.78 | 4.9279 | 138.09 | 4.8546 | 128.35 |
| 6000 | 1.56 | 4.5972 | 99.20 | 4.5742 | 97.00 |
| 9000 | 2.34 | 4.4828 | 88.48 | 4.4717 | 87.52 |
| 12000 | 3.12 | 4.4399 | 84.78 | 4.4343 | 84.30 |
| 15000 | 3.90 | 4.4240 | 83.43 | 4.4129 | 82.52 |
| 18000 | 4.68 | 4.4136 | 82.57 | 4.4039 | 81.77 |
| 21000 | 5.45 | 4.3984 | 81.33 | 4.3911 | 80.74 |
| 24000 | 6.23 | 4.3982 | 81.32 | 4.3876 | 80.45 |
| 27000 | 7.01 | 4.3919 | 80.80 | 4.3842 | 80.17 |
| 30000 | 7.79 | 4.3926 | 80.85 | 4.3815 | 79.98 |
Due to the small parameter size, PicoWord plateued early, which is expected.
Prompt: wh
Output:
why
Prompt: a
Output:
aitee
Prompt: w
Output:
wiized
The generations are varied but as you can tell, most generations will not be real words. The goal is to reflect the morphology of the English and Spanish languages, not memorized word generation.
# =============================================================================
# Inference
# =============================================================================
MODEL_DIR = "harley-ml/PicoWord-5k" # path
TOKENIZER_PATH = "harley-ml/PicoWord-5k"
# --- Generation settings ---
PROMPT = "w" # prompt
MAX_NEW_TOKENS = 32
TEMPERATURE = 1.2
TOP_P = 0.95
TOP_K = 200
REPETITION_PENALTY = 1.1
DO_SAMPLE = True
# =============================================================================
import torch
from pathlib import Path
from transformers import (
AutoModelForCausalLM,
PreTrainedTokenizerFast,
AddedToken,
)
# ---------------------------------------------------------------------------
# Device
# ---------------------------------------------------------------------------
device = (
"cuda" if torch.cuda.is_available() else
"mps" if torch.backends.mps.is_available() else
"cpu"
)
print(f"Device : {device}")
# ---------------------------------------------------------------------------
# Tokenizer (mirrors training setup)
# ---------------------------------------------------------------------------
def load_tokenizer(path: str):
p = Path(path).resolve()
if not p.exists():
raise FileNotFoundError(f"Tokenizer not found: {p}")
tok = PreTrainedTokenizerFast(tokenizer_file=str(p))
specials = {}
if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True)
if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True)
if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True)
if tok.pad_token is None:
if tok.eos_token is not None:
tok.pad_token = tok.eos_token
else:
specials["pad_token"] = AddedToken("<|pad|>", special=True)
if specials:
tok.add_special_tokens(specials)
tok.padding_side = "left" # left-pad for batched generation
return tok
print("Loading tokenizer...")
tokenizer = load_tokenizer(TOKENIZER_PATH)
print(f" Vocab size : {tokenizer.vocab_size}")
print(f" BOS : {tokenizer.bos_token!r}")
print(f" EOS : {tokenizer.eos_token!r}")
print(f" PAD : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})")
# ---------------------------------------------------------------------------
# Model
# ---------------------------------------------------------------------------
print(f"\nLoading model from {MODEL_DIR} ...")
model = AutoModelForCausalLM.from_pretrained(
MODEL_DIR,
dtype=torch.float16 if device == "cuda" else torch.float32,
low_cpu_mem_usage=True,
)
model.eval()
model.to(device)
total_params = sum(p.numel() for p in model.parameters())
print(f" Parameters : {total_params:,}")
# ---------------------------------------------------------------------------
# Generation helper
# ---------------------------------------------------------------------------
def generate(
prompt: str = PROMPT,
max_new_tokens: int = MAX_NEW_TOKENS,
temperature: float = TEMPERATURE,
top_p: float = TOP_P,
top_k: int = TOP_K,
repetition_penalty: float = REPETITION_PENALTY,
do_sample: bool = DO_SAMPLE,
) -> str:
bos = tokenizer.bos_token or ""
full_prompt = bos + prompt
inputs = tokenizer(
full_prompt,
return_tensors="pt",
add_special_tokens=False,
).to(device)
inputs.pop("token_type_ids", None) # Qwen3 doesn't use this
gen_kwargs = dict(
max_new_tokens = max_new_tokens,
do_sample = do_sample,
repetition_penalty = repetition_penalty,
eos_token_id = tokenizer.eos_token_id,
pad_token_id = tokenizer.pad_token_id,
)
if do_sample:
gen_kwargs["temperature"] = temperature
gen_kwargs["top_p"] = top_p
gen_kwargs["top_k"] = top_k
with torch.inference_mode():
output_ids = model.generate(**inputs, **gen_kwargs)
# Strip the prompt tokens so we only return what was generated
prompt_len = inputs["input_ids"].shape[-1]
new_ids = output_ids[0][prompt_len:]
return tokenizer.decode(new_ids, skip_special_tokens=True)
# ---------------------------------------------------------------------------
# Run
# ---------------------------------------------------------------------------
if __name__ == "__main__":
print(f"\nPrompt : {PROMPT!r}")
print("-" * 60)
output = generate(PROMPT)
print("Generated:")
print(output)