BERT-base-cased fine-tuned on OntoNotes 5.0

This model is a fine-tuned version of google-bert/bert-base-cased on the English subset of the OntoNotes 5.0 (CoNLL-2012) dataset. It is designed for Named Entity Recognition (NER) and can identify 18 types of entities.

📊 Performance

The model achieves the following results on the OntoNotes 5.0 test set:

Entity	Precision	Recall	F1-Score	Support
CARDINAL	0.7776	0.8070	0.7920	1005
DATE	0.7943	0.8628	0.8272	1786
EVENT	0.5000	0.6235	0.5550	85
FAC	0.6081	0.6040	0.6061	149
GPE	0.9243	0.9156	0.9199	2546
LANGUAGE	0.7500	0.6818	0.7143	22
LAW	0.5200	0.5909	0.5532	44
LOC	0.6478	0.7442	0.6926	215
MONEY	0.8760	0.9155	0.8953	355
NORP	0.8956	0.9182	0.9067	990
ORDINAL	0.7252	0.7778	0.7506	207
ORG	0.8621	0.8991	0.8802	2002
PERCENT	0.8575	0.9017	0.8790	407
PERSON	0.9080	0.9161	0.9121	2134
PRODUCT	0.5918	0.6444	0.6170	90
QUANTITY	0.7042	0.6536	0.6780	153
TIME	0.5906	0.6667	0.6263	225
WORK_OF_ART	0.6022	0.6450	0.6229	169
micro avg	0.8413	0.8710	0.8559	12584
macro avg	0.7297	0.7649	0.7460	12584
weighted avg	0.8440	0.8710	0.8570	12584

🛠 Training Details

Architecture: BertForTokenClassification
Tokenizer: BertTokenizerFast (using is_split_into_words=True for alignment)
Epochs: 5
Learning Rate: 2e-5
Batch Size: 16 per device (Total 32 on 2x V100 GPUs)
Max Sequence Length: 128
Weight Decay: 0.01
Mixed Precision (FP16): Enabled

📂 Labels Mapping

The model was trained with the following label mapping (18 OntoNotes entities + BIO tags): CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART.

📂 Project Assets

GitHub Repository: https://github.com/Learnrr/ontonotes5_ner_evaluation.git

Asset	File	Description
Model Weights	`model.safetensors`	Main checkpoint in Safetensors format (safe, fast loading, ~431 MB).
Configuration	`config.json`	Model architecture settings and `id2label` entity mapping.
Vocabulary	`vocab.txt`	BERT-cased WordPiece vocabulary for tokenization.
Tokenizer	`tokenizer.json` / `tokenizer_config.json`	Optimized fast tokenizer configuration and serialization.
Special Tokens	`special_tokens_map.json`	Definitions for special tokens like `[CLS]`, `[SEP]`, etc.
Training Args	`training_args.bin`	Detailed hyperparameter settings used during the training run.

🚀 Usage

You can use this model directly with a pipeline for token classification:

from transformers import pipeline

model_checkpoint = "learnrr/bert-base-ontonotes5-ner"
token_classifier = pipeline(
    "token-classification", 
    model=model_checkpoint, 
    aggregation_strategy="simple"
)

text = "Apple was founded by Steve Jobs in Cupertino."
results = token_classifier(text)

for entity in results:
    print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Score: {entity['score']:.4f}")

Downloads last month: 32

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for learnrr/bert-base-ontonotes5-ner

Base model

google-bert/bert-base-cased

Finetuned

(2875)

this model

learnrr
/

bert-base-ontonotes5-ner