Instructions to use Felladrin/Minueza-2-96M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Felladrin/Minueza-2-96M with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Felladrin/Minueza-2-96M")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Felladrin/Minueza-2-96M")
model = AutoModelForCausalLM.from_pretrained("Felladrin/Minueza-2-96M")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Felladrin/Minueza-2-96M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Felladrin/Minueza-2-96M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Felladrin/Minueza-2-96M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Felladrin/Minueza-2-96M

SGLang

How to use Felladrin/Minueza-2-96M with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Felladrin/Minueza-2-96M" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Felladrin/Minueza-2-96M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Felladrin/Minueza-2-96M" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Felladrin/Minueza-2-96M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Felladrin/Minueza-2-96M with Docker Model Runner:
```
docker model run hf.co/Felladrin/Minueza-2-96M
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Minueza-2-96M

Summary

Minueza-2-96M is a compact language model based on the Llama architecture. It was trained from scratch on English and Portuguese datasets, utilising a context length of 4096 tokens and processing 185 billion tokens during the training process. With a parameter count of only 96 million, this model serves as a lightweight foundation that can be subsequently fine-tuned for specific applications.

Due to its compact size, the model has significant limitations in reasoning, factual knowledge, and general capabilities compared to larger models. It may generate incorrect, irrelevant, or nonsensical outputs. Furthermore, as it was trained on internet text data, it may harbour biases and potentially produce inappropriate content.

Usage

pip install transformers==4.50.0 torch==2.6.0

from transformers import pipeline, TextStreamer
import torch

prompt = "This book tells the story"

generate_text = pipeline(
    "text-generation",
    model="Felladrin/Minueza-2-96M",
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)

generate_text(
    prompt,
    streamer=TextStreamer(generate_text.tokenizer, skip_special_tokens=True),
    do_sample=True,
    max_new_tokens=512,
    temperature=0.8,
    top_p=0.95,
    top_k=0,
    min_p=0.05,
    repetition_penalty=1.1,
)

Intended Uses

This model was created with the following objectives in mind:

Run on mobile web browsers via Wllama and Transformers.js.
Run fast on machines without GPU.
Serve as a base for fine-tunes using ChatML format.

Model Architecture

This is a transformer model with the Llama architecture, trained on a context window of 4096 tokens.

Configuration	Value
max_position_embeddings	4096
hidden_size	672
intermediate_size	2688
num_hidden_layers	8
num_attention_heads	12
num_key_value_heads	4
head_dim	56
attention_dropout	0.1
vocab_size	32000
rope_theta	500000

The pretraining was made with these hyperparameters:

Hyperparameter	Value
learning_rate	0.0003
warmup_steps	2000
weight_decay	0.1
max_grad_norm	2.0
total_train_batch_size	512 (2M tokens per batch)
seed	42
optimizer	Adam with betas=(0.9,0.95) and epsilon=1e-08
lr_scheduler_type	linear