Evaluation
The Model class provides the functionality to evaluate any NER model on any NER dataset. Both the fine-tuned NER model and the dataset can either be loaded from HuggingFace (HF) or the Local Filesystem (LF).
Usage
1) load a Model instance:
load model
# from local checkpoint directory
model = Model.from_checkpoint("<checkpoint_directory>")
# from training
model = Model.from_training("<training_name>")
# from HuggingFace
model = Model.from_huggingface("<repo_id>")
2) use the evaluate_on_dataset() method:
model evaluation on dataset
# local dataset in standard format (jsonl)
results = model.evaluate_on_dataset("<local_dataset_in_standard_format>", "jsonl", phase="test")
# local dataset in pretokenized format (csv)
results = model.evaluate_on_dataset("<local_dataset_in_pretokenized_format>", "csv", phase="test")
# huggingface dataset in pretokenized format
results = model.evaluate_on_dataset("<huggingface_dataset_in_pretokenized_format>", "huggingface", phase="test")
Interpretation
The returned object results
is a nested dictionary results[label][level][metric]
where
label
in['micro', 'macro']
level
in['entity', 'token']
metric
in['precision', 'recall', 'f1', 'precision_seqeval', 'recall_seqeval', 'f1_seqeval']
results
results["micro"]["entity"]
# {
# 'precision': 0.912,
# 'recall': 0.919,
# 'f1': 0.916,
# 'precision_seqeval': 0.907,
# 'recall_seqeval': 0.919,
# 'f1_seqeval': 0.913}
# }
The metrics precision
, recall
and f1
are nerblackbox's evaluation results, whereas their counterparts with a _seqeval
suffix correspond to the results you would get using the seqeval library (which is also used by and HuggingFace evaluate).
The difference lies in the way model predictions which are inconsistent with the employed Annotation Scheme are handled.
While nerblackbox's evaluation ignores them, seqeval takes them into account.
Example
A complete example of an evaluation using both the model and the dataset from HuggingFace:
complete evaluation example
# 1. load the model
model = Model.from_huggingface("dslim/bert-base-NER")
# 2. evaluate the model on the dataset
results = model.evaluate_on_dataset("conll2003", "huggingface", phase="test")
# 3. inspect the results
results["micro"]["entity"]
# {
# 'precision': 0.912,
# 'recall': 0.919,
# 'f1': 0.916,
# 'precision_seqeval': 0.907,
# 'recall_seqeval': 0.919,
# 'f1_seqeval': 0.913}
# }
Note that the seqeval results are in accordance with the official results. The nerblackbox results have a slightly higher precision (and f1 score).