Reproduction of Results
Results reported on HuggingFace are reproduced with nerblackbox as a cross-check. We do this separately for:
- Training (pretrained models are fine-tuned for NER and subsequently evaluated on the validation set)
- Evaluation (NER models are evaluated on either the validation or test set)
Note that all numbers refer to the micro-averaged f1 score as computed by the seqeval
library, see Evaluation.
Training
Model | Dataset | Parameters | version | nerblackbox | reported | |
---|---|---|---|---|---|---|
bert-base-cased | conll2003 | {"from_preset": "original", "multiple_runs": 3, "max_epochs": 5, "lr_warmup_epochs": 0} | 1.0.0 |
0.946(1) | 0.951 | reference |
distilbert-base-multilingual-cased | conll2003 | {"from_preset": "original", "multiple_runs": 3, "max_epochs": 3, "lr_warmup_epochs": 0} | 1.0.0 |
0.943(1) | 0.941 | reference |
dmis-lab/biobert-base-cased-v1.2 | ncbi_disease | {"from_preset": "original", "multiple_runs": 3, "max_epochs": 3, "lr_warmup_epochs": 0, "batch_size": 4} | 1.0.0 |
0.855(4) | 0.845 | reference |
distilroberta-base | conll2003 | {"from_preset": "original", "multiple_runs": 3, "max_epochs": 6, "lr_warmup_epochs": 0, "batch_size": 32, "lr_max": 5e-5} | 1.0.0 |
0.953(1) | 0.953 | reference |
microsoft/deberta-base | conll2003 | {"from_preset": "original", "multiple_runs": 3, "max_epochs": 5, "lr_warmup_epochs": 0, "batch_size": 16, "lr_max": 5e-5} | 1.0.0 |
0.957(1) | 0.961 | reference |
Each model was fine-tuned multiple times using nerblackbox and different random seeds. The resulting uncertainty of the mean value is specified in parentheses and refers to the last digit. In all cases, the evaluation was conducted on the respective validation dataset. Note that small, systematic differences are to be expected as the reported results were created using very similar yet slightly different hyperparameters.
The results may be reproduced using the following code:
reproduction of training results
from nerblackbox import Dataset, Training
dataset = Dataset(name=<dataset>, source="HF")
dataset.set_up()
parameters = {"from_preset": "original", [..]}
training = Training("training", model="<model>", dataset="<dataset>", **parameters)
training.run()
result = training.get_result(metric="f1", level="entity", phase="validation")
print(result)
Evaluation
Model | Dataset | Phase | version | nerblackbox | evaluate | reported | |
---|---|---|---|---|---|---|---|
dslim/bert-base-NER | conll2003 | validation | 1.0.0 |
0.951 | 0.951 | 0.951 | reference |
jordyvl/biobert-base-cased-v1.2_ncbi_disease-softmax-labelall-ner | ncbi_disease | validation | 1.0.0 |
0.845 | 0.845 | 0.845 | reference |
fhswf/bert_de_ner | germeval_14 | test | 1.0.0 |
0.818 | 0.818 | 0.829 | reference |
philschmid/distilroberta-base-ner-conll2003 | conll2003 | validation | 1.0.0 |
0.955 | 0.955 | 0.953 | reference |
philschmid/distilroberta-base-ner-conll2003 | conll2003 | test | 1.0.0 |
0.913 | 0.913 | 0.907 | reference |
projecte-aina/roberta-base-ca-cased-ner | projecte-aina/ancora-ca-ner | test | 1.0.0 |
0.896 | 0.896 | 0.881 | reference |
gagan3012/bert-tiny-finetuned-ner | conll2003 | validation | 1.0.0 |
0.847 | --- | 0.818 | reference |
malduwais/distilbert-base-uncased-finetuned-ner | conll2003 | test | 1.0.0 |
0.894 | --- | 0.930 | reference |
IIC/bert-base-spanish-wwm-cased-ehealth_kd | ehealth_kd | test | 1.0.0 |
0.825 | --- | 0.843 | reference |
drAbreu/bioBERT-NER-BC2GM_corpus | bc2gm_corpus | test | 1.0.0 |
0.808 | --- | 0.815 | reference |
geckos/deberta-base-fine-tuned-ner | conll2003 | validation | 1.0.0 |
0.962 | --- | 0.961 | reference |
The results may be reproduced using the following code:
reproduction of evaluation results
###############
# nerblackbox
###############
from nerblackbox import Model
model = Model.from_huggingface("<model>", "<dataset>")
results_nerblackbox = model.evaluate_on_dataset("<dataset>", phase="<phase>")
print(results_nerblackbox["micro"]["entity"]["f1_seqeval"])
###############
# evaluate
###############
from datasets import load_dataset
from evaluate import evaluator
evaluator = evaluator("token-classification")
data = load_dataset("<dataset>", split="<phase>")
results_evaluate = evaluator.compute(
model_or_pipeline="<model>",
data=data,
metric="seqeval",
)
print(results_evaluate["overall_f1"])
Evaluation using the evaluate library fails for some of the tested datasets (---). For all cases where it works, we find that results from nerblackbox and evaluate are in agreement. In contrast, the self-reported results on HuggingFace sometimes differ.