nl-lokaal-klein

Small Dutch PII NER model — 46M parameters, distilled from a 110M-parameter teacher for on-device / low-latency redaction.

nl-lokaal-klein ("klein" = small in Dutch) is a token-classification model that identifies personally identifiable information in Dutch text across 14 categories commonly regulated under the GDPR and Dutch privacy law. It is the small / fast member of the LokaalHub Dutch PII family, paired with the larger nl-lokaal-middel teacher.

At a glance


Base model	`DTAI-KULeuven/robbertje-1-gb-bort`
Parameters	45.3M
Disk size	177 MB (fp32)
Architecture	RoBERTa, 4 layers, hidden 768, 12 attn heads
Max sequence	512 tokens (trained at 384)
Language	Dutch (`nl`)
Task	Token classification (BIO, 47 labels)
License	Apache-2.0
Training data	ai4privacy/pii-masking-300k (Dutch subset) + Dutch open-source NER corpora

Quick start

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id = "LokaalHub/nl-lokaal-klein"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "Mijn naam is Jan van der Berg, BSN 123456782. Bel me op 06-12345678."
for span in ner(text):
    print(f"{span['entity_group']:14} {span['word']!r:30} score={span['score']:.2f}")

Expected output (approximate):

PERSON         'Jan van der Berg'             score=0.97
BSN            '123456782'                    score=0.99
PHONE          '06-12345678'                  score=0.96

Design choices (read before evaluating)

The model makes three deliberate choices that differ from generic multilingual PII models. Understanding them is essential for interpreting evaluation numbers.

1. 14 merged categories, not 50+ fine-grained labels

Many public PII datasets annotate at fine granularity — separate labels for FIRSTNAME, MIDDLENAME, LASTNAME, PREFIX, SEX, etc. For GDPR redaction and pseudonymization, this granularity is noise: downstream code needs to mask "a name", not decide whose first name vs last name it is.

nl-lokaal-klein merges equivalent sub-types at training time so the model predicts one coherent span per entity instead of several adjacent spans:

Merged output	Training sources (ai4privacy 300k labels)
`PERSON`	FIRSTNAME, MIDDLENAME, LASTNAME1, LASTNAME2, LASTNAME3, GIVENNAME1, GIVENNAME2, PREFIX, TITLE, SEX
`ADDRESS`	STREET, BUILDINGNUMBER, BUILDING, SECONDARYADDRESS, SECADDRESS, GEOCOORD, NEARBYGPSCOORDINATE
`CITY`	CITY, STATE, COUNTY, COUNTRY
`POSTAL_CODE`	POSTCODE, ZIPCODE
`IP_ADDRESS`	IP, IPV4, IPV6, MAC
`DATE_OF_BIRTH`	BOD, DOB
`BSN`	SOCIALNUMBER, SSN
`IBAN`	IBAN, ACCOUNTNUMBER
`CREDIT_CARD`	CREDITCARDNUMBER

This improves real-world redaction quality (one clean span over Jan van der Berg instead of three) but costs 5–10 F1 points on strict seqeval when evaluated against datasets that annotate at fine granularity — a prediction of a single ADDRESS over Kerkstraat 15 counts as one prediction, while fine-grained gold counts it as two entities (STREET + BUILDINGNUMBER).

We consider this a feature for the redaction use case and a disclosed limitation for benchmarking.

2. Full label set: 23 entity types, 47 BIO labels

Although most evaluation tables below discuss 14 categories (the ones that appear in ai4privacy Dutch gold), the model predicts 23 entity types. The additional nine are useful in Dutch enterprise / legal / healthcare contexts even if underrepresented in the ai4privacy benchmark:

AGE, BTW (VAT), KVK (Chamber of Commerce), LICENSE_PLATE, ORGANIZATION, TECHNOLOGY, URL, DATE, and PASSPORT/DRIVER_LICENSE (separately from the merged gold).

A complete list is in config.json.

3. Distilled from a 110M-parameter Dutch-native teacher

nl-lokaal-klein was trained via knowledge distillation from LokaalHub/nl-lokaal-middel (RobBERT-2023, 110M params). The teacher was fine-tuned on the same data first; its soft-label distributions guided the student. This recovers roughly 75% of the teacher's F1 at ~42% of its parameter count.

Evaluation

All numbers below use seqeval strict (IOB2 scheme) — the most conservative token-classification metric that requires exact entity-boundary matches. Raw model predictions unless labeled pipeline.

Primary result — in-distribution (ai4privacy 300k validation, Dutch)

Model	Params	F1	Precision	Recall
nl-lokaal-klein (this model)	46M	0.7790	0.7689	0.7895
nl-lokaal-klein + filenthropist rule layer	46M	0.7764	0.7847	0.7683
nl-lokaal-middel (teacher)	110M	0.8435	0.8070	0.8834

Evaluated on 7,457 Dutch rows of ai4privacy/pii-masking-300k validation — 47,638 gold entities after 14-category merge. Both models were trained on the 300k train split; the validation split is fully held out.

On post-processing: nl-lokaal-klein is deployed inside filenthropist with an optional rule layer (regex format-validators for BSN/IBAN/POSTAL_CODE, per-type score thresholds). We measured this layer on the same 300k validation split: F1 0.7764 vs 0.7790 raw — a wash. The rule layer was tuned for a narrower high-precision benchmark and trades recall for precision on the broader distribution, so most deployments should run the model directly unless you need the format-validation backstops for specific regulated types.

Related work

The closest comparable open Dutch PII model is OpenMed/OpenMed-PII-Dutch-BioClinicalBERT-Base-110M-v1 (110M params, Apache-2.0), trained on ai4privacy/pii-masking-400k with a 54-label fine-grained scheme. It reports F1 0.8401 on its own 400k held-out benchmark. A direct head-to-head is not scientifically meaningful — different test sets, different label taxonomies (54 fine-grained vs our 14 merged), different boundary conventions — but nl-lokaal-middel reaches a comparable F1 (0.8435) on a comparably-sized Dutch PII held-out set, suggesting parity at this model size.

We did not include OpenMed in the primary table above because its head-to-head score under our merged 14-category scheme reflects label-vocabulary translation artifacts more than model capability.

Per-category breakdown (300k validation, raw model, nl-lokaal-klein)

Category	Support	Precision	Recall	F1
EMAIL	2,540	0.9466	0.9500	0.9483
IP_ADDRESS	2,199	0.8490	0.9227	0.8843
DRIVER_LICENSE	2,429	0.8718	0.8592	0.8654
PHONE	1,932	0.8012	0.8908	0.8436
PASSWORD	1,443	0.8032	0.8572	0.8294
USERNAME	2,571	0.8817	0.7651	0.8192
DATE_OF_BIRTH	2,165	0.8037	0.8226	0.8131
BSN	2,439	0.7698	0.8130	0.7908
DATE	5,242	0.7388	0.8373	0.7849
PASSPORT	4,540	0.7896	0.7267	0.7568
CITY	5,141	0.7519	0.7440	0.7479
PERSON	8,673	0.6974	0.7467	0.7212
ADDRESS	4,517	0.6663	0.7228	0.6934
POSTAL_CODE	1,807	0.7689	0.6298	0.6924
micro avg	47,638	0.7689	0.7895	0.7790
macro avg	47,638	0.7957	0.8063	0.7994

Strongest categories: structured types with clear formal patterns (EMAIL F1 0.95, IP_ADDRESS 0.88, DRIVER_LICENSE 0.87). Weakest: compound-span types where boundary ambiguity matters most (POSTAL_CODE 0.69, ADDRESS 0.69, PERSON 0.72) — these suffer most under strict seqeval when the 14-category merge collapses sub-spans that gold sometimes splits.

How to reproduce the evaluation

pip install datasets transformers seqeval
python compare_models.py --dataset 300k

Label-mapping tables used by the script are documented in the "Design choices" section above.

Training procedure

Data

Source	Weight	Role
`ai4privacy/pii-masking-300k` (Dutch)	20%	Real PII spans with character-level annotations
Teacher pseudo-labels	28%	Soft supervision from `nl-lokaal-middel` on unlabeled Dutch
Template-generated synthetic	17%	Faker nl_NL — fills rare types (IBAN, BSN, KVK)
LLM-generated structured HTML/JSON	15%	Realistic form/record layouts
LLM-generated clean prose	10%	Fluent Dutch passages with injected entities
`Babelscape/wikineural` (nl)	3%	Dutch Wikipedia NER (PER/ORG/LOC)
`Babelscape/multinerd` (nl)	3%	Dutch MultiNERD fine-grained NER
`gretelai/synthetic_pii_finance_multilingual` (nl)	2%	Dutch financial PII
`careons/dutch-healthcare-pii-ner`	2%	Dutch healthcare PII

Total: ~40K samples, 25% entity-replacement augmentation → ~50K effective training examples.

Distillation setup

Teacher: nl-lokaal-middel (110M, RobBERT-2023)
Loss: cross-entropy on hard labels + KL-divergence to teacher soft probabilities
Temperature: 3.0, alpha 0.5
B-tag boundary weight: 2.0× (to improve strict boundary precision)

Hyperparameter search — our own autoresearch loop

Hyperparameters were not hand-tuned. We used an in-house autoresearch agent that iterates on the config, trains, evaluates on a held-out benchmark, and either keeps or reverts each change — all autonomously. Over 100+ experiments explored learning rate, epochs, sequence length, label smoothing, B-tag weight, data mix ratios, augmentation ratio, and loss variants. Experiment exp-006 (below) was the winner.

The pattern is inspired by Andrej Karpathy's minimal-loop approach to ML research — small, readable code, fast iteration, measured decisions.

Hyperparameters (exp-006 — winning run)


Optimizer	AdamW, weight decay 0.02
Learning rate	3.5e-5, cosine schedule, 10% warmup
Epochs	3
Batch size	16 × 2 gradient accumulation = 32 effective
Max sequence length	384
Label smoothing	0.0
FP16	Enabled
Seed	42
Hardware	Apple Silicon (MPS)

Intended use

In scope

GDPR redaction / pseudonymization of Dutch text (legal documents, emails, customer-service transcripts, medical notes).
On-device or low-latency PII detection where larger models (110M+) are too slow or too big.
First-stage filtering before a higher-capacity model or human review.

Out of scope

High-assurance detection where missing a single PII instance has legal consequences. Use nl-lokaal-middel or ensemble with multiple models and manual review.
Languages other than Dutch. The model has not been evaluated on Frisian, Flemish (Belgian Dutch only tangentially), or code-mixed text.
Anonymization (irreversible removal with re-identification prevention). PII detection is necessary but not sufficient — you also need k-anonymity analysis of retained fields.
Identifying individual Dutch sub-types (distinguishing first vs last names) — by design, these are merged to PERSON.

Limitations & biases

Boundary conventions follow ai4privacy 300k. Datasets with different boundaries (e.g., splitting STREET + BUILDINGNUMBER into two separate entities) will score lower under strict evaluation even when the model is qualitatively correct.
Dutch gazetteer coverage is based on CBS (Centraal Bureau voor de Statistiek) and open sources; uncommon or immigrant-origin Dutch names may recall below average.
Form / structured-data bias. Synthetic training data leans toward form-like text; free-prose classification may be slightly lower.
Date ambiguity. Dutch date formats (17-04-2026, 17 april 2026, 17/04/26) are all supported, but short "1/4" fragments are intentionally rejected by the production pipeline to reduce false positives.
No PII categories outside the trained 23 — no detection of medical-record identifiers, custom corporate IDs, biometric descriptors, etc.

Ethical and legal considerations

This model detects PII. It does not remove, redact, or anonymize it — those operations are the operator's responsibility and must meet the standards of GDPR (Reg. (EU) 2016/679), UAVG, and for AI-system use cases the EU AI Act (Reg. (EU) 2024/1689).
Do not treat model output as evidence of what is or is not PII for legal purposes. Keep a human reviewer in the loop for legally consequential redactions.
The model processes text in-memory; no data is sent to external services when run locally. If deployed behind an API, log retention and transit encryption are the operator's responsibility.

Attribution & citation

Base model: RobBERTje-1-gb-bort by DTAI-KULeuven, distilled from RobBERT (MIT license).

Training data: ai4privacy/pii-masking-300k (CC-BY-4.0). Secondary data from Babelscape, Gretel, Careons, and synthetic generation.

Teacher: LokaalHub/nl-lokaal-middel.

If you use nl-lokaal-klein in research or production, please cite:

@misc{nl_lokaal_klein_2026,
  title         = {nl-lokaal-klein: Efficient Dutch PII NER via Knowledge Distillation},
  author        = {LokaalHub},
  year          = {2026},
  publisher     = {Hugging Face},
  url           = {https://huggingface.co/LokaalHub/nl-lokaal-klein}
}

And please cite the base model and training data:

@inproceedings{delobelle2020robbert,
  title     = {{RobBERT}: a {Dutch} {RoBERTa}-based Language Model},
  author    = {Delobelle, Pieter and Winters, Thomas and Berendt, Bettina},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2020},
  year      = {2020}
}

@misc{ai4privacy_2024,
  title     = {{PII}-Masking-300k: A Multilingual {PII} Detection Dataset},
  author    = {{ai4privacy}},
  year      = {2024},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/ai4privacy/pii-masking-300k}
}

Changelog

v1.0 — 2026-04-19 — Initial release. Student checkpoint from distillation experiment exp-006.

Built in the Netherlands — optimized for Dutch privacy law, trained on Dutch data, shipped under Apache-2.0.

Downloads last month: 90

Safetensors

Model size

45.3M params

Tensor type

F32

Model tree for LokaalHub/nl-lokaal-klein

Base model

DTAI-KULeuven/robbertje-1-gb-bort

Finetuned

(3)

this model

Dataset used to train LokaalHub/nl-lokaal-klein

Evaluation results

seqeval strict F1 (raw model) on ai4privacy pii-masking-300k — Dutch validation
validation set self-reported

0.779
seqeval strict precision (raw model) on ai4privacy pii-masking-300k — Dutch validation
validation set self-reported

0.769
seqeval strict recall (raw model) on ai4privacy pii-masking-300k — Dutch validation
validation set self-reported

0.789