BERT-base-cased fine-tuned on OntoNotes 5.0
This model is a fine-tuned version of google-bert/bert-base-cased on the English subset of the OntoNotes 5.0 (CoNLL-2012) dataset. It is designed for Named Entity Recognition (NER) and can identify 18 types of entities.
π Performance
The model achieves the following results on the OntoNotes 5.0 test set:
| Entity | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| CARDINAL | 0.7776 | 0.8070 | 0.7920 | 1005 |
| DATE | 0.7943 | 0.8628 | 0.8272 | 1786 |
| EVENT | 0.5000 | 0.6235 | 0.5550 | 85 |
| FAC | 0.6081 | 0.6040 | 0.6061 | 149 |
| GPE | 0.9243 | 0.9156 | 0.9199 | 2546 |
| LANGUAGE | 0.7500 | 0.6818 | 0.7143 | 22 |
| LAW | 0.5200 | 0.5909 | 0.5532 | 44 |
| LOC | 0.6478 | 0.7442 | 0.6926 | 215 |
| MONEY | 0.8760 | 0.9155 | 0.8953 | 355 |
| NORP | 0.8956 | 0.9182 | 0.9067 | 990 |
| ORDINAL | 0.7252 | 0.7778 | 0.7506 | 207 |
| ORG | 0.8621 | 0.8991 | 0.8802 | 2002 |
| PERCENT | 0.8575 | 0.9017 | 0.8790 | 407 |
| PERSON | 0.9080 | 0.9161 | 0.9121 | 2134 |
| PRODUCT | 0.5918 | 0.6444 | 0.6170 | 90 |
| QUANTITY | 0.7042 | 0.6536 | 0.6780 | 153 |
| TIME | 0.5906 | 0.6667 | 0.6263 | 225 |
| WORK_OF_ART | 0.6022 | 0.6450 | 0.6229 | 169 |
| micro avg | 0.8413 | 0.8710 | 0.8559 | 12584 |
| macro avg | 0.7297 | 0.7649 | 0.7460 | 12584 |
| weighted avg | 0.8440 | 0.8710 | 0.8570 | 12584 |
π Training Details
- Architecture:
BertForTokenClassification - Tokenizer:
BertTokenizerFast(usingis_split_into_words=Truefor alignment) - Epochs: 5
- Learning Rate: 2e-5
- Batch Size: 16 per device (Total 32 on 2x V100 GPUs)
- Max Sequence Length: 128
- Weight Decay: 0.01
- Mixed Precision (FP16): Enabled
π Labels Mapping
The model was trained with the following label mapping (18 OntoNotes entities + BIO tags):
CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART.
π Project Assets
- GitHub Repository: https://github.com/Learnrr/ontonotes5_ner_evaluation.git
Asset File Description Model Weights model.safetensorsMain checkpoint in Safetensors format (safe, fast loading, ~431 MB). Configuration config.jsonModel architecture settings and id2labelentity mapping.Vocabulary vocab.txtBERT-cased WordPiece vocabulary for tokenization. Tokenizer tokenizer.json/tokenizer_config.jsonOptimized fast tokenizer configuration and serialization. Special Tokens special_tokens_map.jsonDefinitions for special tokens like [CLS],[SEP], etc.Training Args training_args.binDetailed hyperparameter settings used during the training run.
π Usage
You can use this model directly with a pipeline for token classification:
from transformers import pipeline
model_checkpoint = "learnrr/bert-base-ontonotes5-ner"
token_classifier = pipeline(
"token-classification",
model=model_checkpoint,
aggregation_strategy="simple"
)
text = "Apple was founded by Steve Jobs in Cupertino."
results = token_classifier(text)
for entity in results:
print(f"Entity: {entity['word']} | Label: {entity['entity_group']} | Score: {entity['score']:.4f}")
- Downloads last month
- 32
Model tree for learnrr/bert-base-ontonotes5-ner
Base model
google-bert/bert-base-cased