Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation
About the Model
This model was developed for the paper Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation.
It is based on the SmolLM2 architecture, but instead of the original English tokenizer, it uses the multilingual Llama 3.3 tokenizer with a 128K vocabulary. This results in a 180M-parameter base model.
The base models were trained on:
There exist 2 English base models: A 180M base model, and a 572M base model that is the 180M base model HyperCloned after 80% of the training steps, and trained on the remaining 20% of the data afterwards.
Models were adapted to each target language using the corresponding FineWeb-2 subset. The mutilingual models were adapted the same way, but on the FineWeb-2 subsets for all target languages at once.
We evaluate three adaptation setups:
- 1× — Continue pretraining the 180M base model on the target language
- 1× (cloned) — HyperClone the 180M base model to 572M, then continue pretraining on the target language
- 2× — Continue pretraining the 572M base model on the target language
For more details on our setups, please refer to our paper.
Usage
The model can be loaded like this. But note that this is a base model; it is not instruction-tuned.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "liu-nlp/model-id"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
prompt = "To be hypercloned feels"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Limitations
Users should apply the model responsibly, particularly in high-stakes or user-facing applications. The model is not guaranteed to provide factual, neutral, or safe outputs. Additional safety filtering, human oversight, or RLHF-style alignment may be required depending on the deployment context. Models of such sm all sizes are expected to produce hallucinations, especially for topics underrepresented or inconsistently represented in the training data.
While the base model’s pre-training corpus presumably excludes overtly unethical or harmful material due to its educational focus, the continued pre-training phase introduces additional considerations. Large-scale corpora for our target languages are comparatively limited in availability and curation quality. As a result, the datasets used for domain or language adaptation may contain inappropriate, biased, or otherwise undesirable content.
Citation
If you use this model, please cite:
@misc{glocker2025growmergescalingstrategies,
title={Grow Up and Merge: Scaling Strategies for Efficient Language Adaptation},
author={Kevin Glocker and Kätriin Kukk and Romina Oji and Marcel Bollmann and Marco Kuhlmann and Jenny Kunz},
year={2025},
eprint={2512.10772},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.10772},
}
- Downloads last month
- 22