nl-lokaal-klein
Small Dutch PII NER model — 46M parameters, distilled from a 110M-parameter teacher for on-device / low-latency redaction.
nl-lokaal-klein ("klein" = small in Dutch) is a token-classification model that identifies personally identifiable information in Dutch text across 14 categories commonly regulated under the GDPR and Dutch privacy law. It is the small / fast member of the LokaalHub Dutch PII family, paired with the larger nl-lokaal-middel teacher.
At a glance
| Base model | DTAI-KULeuven/robbertje-1-gb-bort |
| Parameters | 45.3M |
| Disk size | 177 MB (fp32) |
| Architecture | RoBERTa, 4 layers, hidden 768, 12 attn heads |
| Max sequence | 512 tokens (trained at 384) |
| Language | Dutch (nl) |
| Task | Token classification (BIO, 47 labels) |
| License | Apache-2.0 |
| Training data | ai4privacy/pii-masking-300k (Dutch subset) + Dutch open-source NER corpora |
Quick start
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_id = "LokaalHub/nl-lokaal-klein"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "Mijn naam is Jan van der Berg, BSN 123456782. Bel me op 06-12345678."
for span in ner(text):
print(f"{span['entity_group']:14} {span['word']!r:30} score={span['score']:.2f}")
Expected output (approximate):
PERSON 'Jan van der Berg' score=0.97
BSN '123456782' score=0.99
PHONE '06-12345678' score=0.96
Design choices (read before evaluating)
The model makes three deliberate choices that differ from generic multilingual PII models. Understanding them is essential for interpreting evaluation numbers.
1. 14 merged categories, not 50+ fine-grained labels
Many public PII datasets annotate at fine granularity — separate labels for FIRSTNAME, MIDDLENAME, LASTNAME, PREFIX, SEX, etc. For GDPR redaction and pseudonymization, this granularity is noise: downstream code needs to mask "a name", not decide whose first name vs last name it is.
nl-lokaal-klein merges equivalent sub-types at training time so the model predicts one coherent span per entity instead of several adjacent spans:
| Merged output | Training sources (ai4privacy 300k labels) |
|---|---|
PERSON |
FIRSTNAME, MIDDLENAME, LASTNAME1, LASTNAME2, LASTNAME3, GIVENNAME1, GIVENNAME2, PREFIX, TITLE, SEX |
ADDRESS |
STREET, BUILDINGNUMBER, BUILDING, SECONDARYADDRESS, SECADDRESS, GEOCOORD, NEARBYGPSCOORDINATE |
CITY |
CITY, STATE, COUNTY, COUNTRY |
POSTAL_CODE |
POSTCODE, ZIPCODE |
IP_ADDRESS |
IP, IPV4, IPV6, MAC |
DATE_OF_BIRTH |
BOD, DOB |
BSN |
SOCIALNUMBER, SSN |
IBAN |
IBAN, ACCOUNTNUMBER |
CREDIT_CARD |
CREDITCARDNUMBER |
This improves real-world redaction quality (one clean span over Jan van der Berg instead of three) but costs 5–10 F1 points on strict seqeval when evaluated against datasets that annotate at fine granularity — a prediction of a single ADDRESS over Kerkstraat 15 counts as one prediction, while fine-grained gold counts it as two entities (STREET + BUILDINGNUMBER).
We consider this a feature for the redaction use case and a disclosed limitation for benchmarking.
2. Full label set: 23 entity types, 47 BIO labels
Although most evaluation tables below discuss 14 categories (the ones that appear in ai4privacy Dutch gold), the model predicts 23 entity types. The additional nine are useful in Dutch enterprise / legal / healthcare contexts even if underrepresented in the ai4privacy benchmark:
AGE, BTW (VAT), KVK (Chamber of Commerce), LICENSE_PLATE, ORGANIZATION, TECHNOLOGY, URL, DATE, and PASSPORT/DRIVER_LICENSE (separately from the merged gold).
A complete list is in config.json.
3. Distilled from a 110M-parameter Dutch-native teacher
nl-lokaal-klein was trained via knowledge distillation from LokaalHub/nl-lokaal-middel (RobBERT-2023, 110M params). The teacher was fine-tuned on the same data first; its soft-label distributions guided the student. This recovers roughly 75% of the teacher's F1 at ~42% of its parameter count.
Evaluation
All numbers below use seqeval strict (IOB2 scheme) — the most conservative token-classification metric that requires exact entity-boundary matches. Raw model predictions unless labeled pipeline.
Primary result — in-distribution (ai4privacy 300k validation, Dutch)
| Model | Params | F1 | Precision | Recall |
|---|---|---|---|---|
| nl-lokaal-klein (this model) | 46M | 0.7790 | 0.7689 | 0.7895 |
| nl-lokaal-klein + filenthropist rule layer | 46M | 0.7764 | 0.7847 | 0.7683 |
| nl-lokaal-middel (teacher) | 110M | 0.8435 | 0.8070 | 0.8834 |
Evaluated on 7,457 Dutch rows of ai4privacy/pii-masking-300k validation — 47,638 gold entities after 14-category merge. Both models were trained on the 300k train split; the validation split is fully held out.
On post-processing:
nl-lokaal-kleinis deployed inside filenthropist with an optional rule layer (regex format-validators for BSN/IBAN/POSTAL_CODE, per-type score thresholds). We measured this layer on the same 300k validation split: F1 0.7764 vs 0.7790 raw — a wash. The rule layer was tuned for a narrower high-precision benchmark and trades recall for precision on the broader distribution, so most deployments should run the model directly unless you need the format-validation backstops for specific regulated types.
Related work
The closest comparable open Dutch PII model is OpenMed/OpenMed-PII-Dutch-BioClinicalBERT-Base-110M-v1 (110M params, Apache-2.0), trained on ai4privacy/pii-masking-400k with a 54-label fine-grained scheme. It reports F1 0.8401 on its own 400k held-out benchmark. A direct head-to-head is not scientifically meaningful — different test sets, different label taxonomies (54 fine-grained vs our 14 merged), different boundary conventions — but nl-lokaal-middel reaches a comparable F1 (0.8435) on a comparably-sized Dutch PII held-out set, suggesting parity at this model size.
We did not include OpenMed in the primary table above because its head-to-head score under our merged 14-category scheme reflects label-vocabulary translation artifacts more than model capability.
Per-category breakdown (300k validation, raw model, nl-lokaal-klein)
| Category | Support | Precision | Recall | F1 |
|---|---|---|---|---|
| 2,540 | 0.9466 | 0.9500 | 0.9483 | |
| IP_ADDRESS | 2,199 | 0.8490 | 0.9227 | 0.8843 |
| DRIVER_LICENSE | 2,429 | 0.8718 | 0.8592 | 0.8654 |
| PHONE | 1,932 | 0.8012 | 0.8908 | 0.8436 |
| PASSWORD | 1,443 | 0.8032 | 0.8572 | 0.8294 |
| USERNAME | 2,571 | 0.8817 | 0.7651 | 0.8192 |
| DATE_OF_BIRTH | 2,165 | 0.8037 | 0.8226 | 0.8131 |
| BSN | 2,439 | 0.7698 | 0.8130 | 0.7908 |
| DATE | 5,242 | 0.7388 | 0.8373 | 0.7849 |
| PASSPORT | 4,540 | 0.7896 | 0.7267 | 0.7568 |
| CITY | 5,141 | 0.7519 | 0.7440 | 0.7479 |
| PERSON | 8,673 | 0.6974 | 0.7467 | 0.7212 |
| ADDRESS | 4,517 | 0.6663 | 0.7228 | 0.6934 |
| POSTAL_CODE | 1,807 | 0.7689 | 0.6298 | 0.6924 |
| micro avg | 47,638 | 0.7689 | 0.7895 | 0.7790 |
| macro avg | 47,638 | 0.7957 | 0.8063 | 0.7994 |
Strongest categories: structured types with clear formal patterns (EMAIL F1 0.95, IP_ADDRESS 0.88, DRIVER_LICENSE 0.87). Weakest: compound-span types where boundary ambiguity matters most (POSTAL_CODE 0.69, ADDRESS 0.69, PERSON 0.72) — these suffer most under strict seqeval when the 14-category merge collapses sub-spans that gold sometimes splits.
How to reproduce the evaluation
pip install datasets transformers seqeval
python compare_models.py --dataset 300k
Label-mapping tables used by the script are documented in the "Design choices" section above.
Training procedure
Data
| Source | Weight | Role |
|---|---|---|
ai4privacy/pii-masking-300k (Dutch) |
20% | Real PII spans with character-level annotations |
| Teacher pseudo-labels | 28% | Soft supervision from nl-lokaal-middel on unlabeled Dutch |
| Template-generated synthetic | 17% | Faker nl_NL — fills rare types (IBAN, BSN, KVK) |
| LLM-generated structured HTML/JSON | 15% | Realistic form/record layouts |
| LLM-generated clean prose | 10% | Fluent Dutch passages with injected entities |
Babelscape/wikineural (nl) |
3% | Dutch Wikipedia NER (PER/ORG/LOC) |
Babelscape/multinerd (nl) |
3% | Dutch MultiNERD fine-grained NER |
gretelai/synthetic_pii_finance_multilingual (nl) |
2% | Dutch financial PII |
careons/dutch-healthcare-pii-ner |
2% | Dutch healthcare PII |
Total: ~40K samples, 25% entity-replacement augmentation → ~50K effective training examples.
Distillation setup
- Teacher:
nl-lokaal-middel(110M, RobBERT-2023) - Loss: cross-entropy on hard labels + KL-divergence to teacher soft probabilities
- Temperature: 3.0, alpha 0.5
- B-tag boundary weight: 2.0× (to improve strict boundary precision)
Hyperparameter search — our own autoresearch loop
Hyperparameters were not hand-tuned. We used an in-house autoresearch agent that iterates on the config, trains, evaluates on a held-out benchmark, and either keeps or reverts each change — all autonomously. Over 100+ experiments explored learning rate, epochs, sequence length, label smoothing, B-tag weight, data mix ratios, augmentation ratio, and loss variants. Experiment exp-006 (below) was the winner.
The pattern is inspired by Andrej Karpathy's minimal-loop approach to ML research — small, readable code, fast iteration, measured decisions.
Hyperparameters (exp-006 — winning run)
| Optimizer | AdamW, weight decay 0.02 |
| Learning rate | 3.5e-5, cosine schedule, 10% warmup |
| Epochs | 3 |
| Batch size | 16 × 2 gradient accumulation = 32 effective |
| Max sequence length | 384 |
| Label smoothing | 0.0 |
| FP16 | Enabled |
| Seed | 42 |
| Hardware | Apple Silicon (MPS) |
Intended use
In scope
- GDPR redaction / pseudonymization of Dutch text (legal documents, emails, customer-service transcripts, medical notes).
- On-device or low-latency PII detection where larger models (110M+) are too slow or too big.
- First-stage filtering before a higher-capacity model or human review.
Out of scope
- High-assurance detection where missing a single PII instance has legal consequences. Use
nl-lokaal-middelor ensemble with multiple models and manual review. - Languages other than Dutch. The model has not been evaluated on Frisian, Flemish (Belgian Dutch only tangentially), or code-mixed text.
- Anonymization (irreversible removal with re-identification prevention). PII detection is necessary but not sufficient — you also need k-anonymity analysis of retained fields.
- Identifying individual Dutch sub-types (distinguishing first vs last names) — by design, these are merged to
PERSON.
Limitations & biases
- Boundary conventions follow ai4privacy 300k. Datasets with different boundaries (e.g., splitting
STREET+BUILDINGNUMBERinto two separate entities) will score lower under strict evaluation even when the model is qualitatively correct. - Dutch gazetteer coverage is based on CBS (Centraal Bureau voor de Statistiek) and open sources; uncommon or immigrant-origin Dutch names may recall below average.
- Form / structured-data bias. Synthetic training data leans toward form-like text; free-prose classification may be slightly lower.
- Date ambiguity. Dutch date formats (
17-04-2026,17 april 2026,17/04/26) are all supported, but short "1/4" fragments are intentionally rejected by the production pipeline to reduce false positives. - No PII categories outside the trained 23 — no detection of medical-record identifiers, custom corporate IDs, biometric descriptors, etc.
Ethical and legal considerations
- This model detects PII. It does not remove, redact, or anonymize it — those operations are the operator's responsibility and must meet the standards of GDPR (Reg. (EU) 2016/679), UAVG, and for AI-system use cases the EU AI Act (Reg. (EU) 2024/1689).
- Do not treat model output as evidence of what is or is not PII for legal purposes. Keep a human reviewer in the loop for legally consequential redactions.
- The model processes text in-memory; no data is sent to external services when run locally. If deployed behind an API, log retention and transit encryption are the operator's responsibility.
Attribution & citation
Base model: RobBERTje-1-gb-bort by DTAI-KULeuven, distilled from RobBERT (MIT license).
Training data: ai4privacy/pii-masking-300k (CC-BY-4.0). Secondary data from Babelscape, Gretel, Careons, and synthetic generation.
Teacher: LokaalHub/nl-lokaal-middel.
If you use nl-lokaal-klein in research or production, please cite:
@misc{nl_lokaal_klein_2026,
title = {nl-lokaal-klein: Efficient Dutch PII NER via Knowledge Distillation},
author = {LokaalHub},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/LokaalHub/nl-lokaal-klein}
}
And please cite the base model and training data:
@inproceedings{delobelle2020robbert,
title = {{RobBERT}: a {Dutch} {RoBERTa}-based Language Model},
author = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
year = {2020}
}
@misc{ai4privacy_2024,
title = {{PII}-Masking-300k: A Multilingual {PII} Detection Dataset},
author = {{ai4privacy}},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/ai4privacy/pii-masking-300k}
}
Changelog
- v1.0 — 2026-04-19 — Initial release. Student checkpoint from distillation experiment exp-006.
Built in the Netherlands — optimized for Dutch privacy law, trained on Dutch data, shipped under Apache-2.0.
- Downloads last month
- 90
Model tree for LokaalHub/nl-lokaal-klein
Base model
DTAI-KULeuven/robbertje-1-gb-bortDataset used to train LokaalHub/nl-lokaal-klein
Evaluation results
- seqeval strict F1 (raw model) on ai4privacy pii-masking-300k — Dutch validationvalidation set self-reported0.779
- seqeval strict precision (raw model) on ai4privacy pii-masking-300k — Dutch validationvalidation set self-reported0.769
- seqeval strict recall (raw model) on ai4privacy pii-masking-300k — Dutch validationvalidation set self-reported0.789