Evaluation

The Model class provides the functionality to evaluate any NER model on any NER dataset. Both the fine-tuned NER model and the dataset can either be loaded from HuggingFace (HF) or the Local Filesystem (LF).

Usage

1) load a Model instance:

load model

Python

# from local checkpoint directory
model = Model.from_checkpoint("<checkpoint_directory>")

# from training
model = Model.from_training("<training_name>")

# from HuggingFace
model = Model.from_huggingface("<repo_id>")

2) use the evaluate_on_dataset() method:

model evaluation on dataset

Python

# local dataset in standard format (jsonl)
results = model.evaluate_on_dataset("<local_dataset_in_standard_format>", "jsonl", phase="test")

# local dataset in pretokenized format (csv)
results = model.evaluate_on_dataset("<local_dataset_in_pretokenized_format>", "csv", phase="test")

# huggingface dataset in pretokenized format
results = model.evaluate_on_dataset("<huggingface_dataset_in_pretokenized_format>", "huggingface", phase="test")

Interpretation

The returned object results is a nested dictionary results[label][level][metric] where

label in ['micro', 'macro']
level in ['entity', 'token']
metric in ['precision', 'recall', 'f1', 'precision_seqeval', 'recall_seqeval', 'f1_seqeval']

results

results["micro"]["entity"]
# {
#   'precision': 0.912,
#   'recall': 0.919,
#   'f1': 0.916,
#   'precision_seqeval': 0.907,
#   'recall_seqeval': 0.919,
#   'f1_seqeval': 0.913}
# }

The metrics precision, recall and f1 are nerblackbox's evaluation results, whereas their counterparts with a _seqeval suffix correspond to the results you would get using the seqeval library (which is also used by and HuggingFace evaluate). The difference lies in the way model predictions which are inconsistent with the employed Annotation Scheme are handled. While nerblackbox's evaluation ignores them, seqeval takes them into account.

Example

A complete example of an evaluation using both the model and the dataset from HuggingFace:

complete evaluation example

Python

# 1. load the model
model = Model.from_huggingface("dslim/bert-base-NER")

# 2. evaluate the model on the dataset
results = model.evaluate_on_dataset("conll2003", "huggingface", phase="test")

# 3. inspect the results
results["micro"]["entity"]
# {
#   'precision': 0.912,
#   'recall': 0.919,
#   'f1': 0.916,
#   'precision_seqeval': 0.907,
#   'recall_seqeval': 0.919,
#   'f1_seqeval': 0.913}
# }

Note that the seqeval results are in accordance with the official results. The nerblackbox results have a slightly higher precision (and f1 score).