Training
Given a dataset that is properly set up, we can fine-tune a pretrained model for Named Entity Recognition.
Model Sources
nerblackbox works with PyTorch transformer models only. They can either be taken straight from HuggingFace (HF) or the Local Filesystem (LF). In order to employ models from HF, it is sufficient to specify the name of the model (see Basic Training).
Local models need to be stored in a directory ./store/pretrained_models/<my_model>
and (at least) include the following files:
config.json
pytorch_model.bin
tokenizer_config.json
,tokenizer.json
,vocab.json
(orvocab.txt
)
Note that the name for <my_model>
must include the architecture type, e.g. bert
.
Basic Training
A specific model can be trained on a specific dataset using specific parameters using the Training class.
Define the Training
The training is defined
-
either dynamically through arguments when a Training instance is created
define training dynamically
training = Training("<training_name>", model="<model_name>", dataset="<dataset_name>")
-
or statically by a training configuration file
./store/training_configs/<training_name>.ini
.define training statically
training = Training("<training_name>", from_config=True)
Note that the dynamic variant also creates a training configuration, which is subsequently used.
In both cases, the specification of the model
and the dataset
are mandatory and sufficient.
Training Parameters may be specified but are optional. The hyperparameters that are used by default are globally applicable settings that should give close-to-optimal results for any use case.
In particular, adaptive fine-tuning is employed to ensure that this holds irrespective of the size of the dataset.
Run the Training
The training is run using the following command:
run training
training.run()
See the Python API documentation for further details.
Main Results
When the training is finished, one can get its main results like so:
Main Results (single training)
training.get_result(metric="f1", level="entity", phase="test")
See the Python API documentation for further details.
An overview of all conducted trainings and their main results can be accessed using the Store class:
Main Results (all trainings)
Store.show_trainings()
Example
An English BERT model can be trained on the CoNLL-2003 dataset like this:
Example: Training
training = Training("my_training", model="bert-base-cased", dataset="conll2003")
training.run()
training.get_result(metric="f1", level="entity", phase="test")
# 0.9045
Advanced Training
Parameters
nerblackbox uses a large amount of default (hyper)parameters that can be customized as needed. The concerned parameters just need to be specified when the training is defined, either statically or dynamically.
-
In the static case, a training configuration file may look like this:
Example: static training configuration file with parameters
# my_training.ini [dataset] dataset_name = swedish_ner_corpus annotation_scheme = plain train_fraction = 0.1 # for testing val_fraction = 1.0 test_fraction = 1.0 train_on_val = False train_on_test = False [model] pretrained_model_name = af-ai-center/bert-base-swedish-uncased [settings] checkpoints = True logging_level = info multiple_runs = 1 seed = 42 [hparams] max_epochs = 250 early_stopping = True monitor = val_loss min_delta = 0.0 patience = 0 mode = min lr_warmup_epochs = 2 lr_num_cycles = 4 lr_cooldown_restarts = True lr_cooldown_epochs = 7 [runA] batch_size = 16 max_seq_length = 128 lr_max = 2e-5 lr_schedule = constant
-
In the dynamic case, the equivalent example is:
Example: dynamic training with parameters
training = Training( "my_training", model="af-ai-center/bert-base-swedish-uncased", # model = model_name dataset="swedish_ner_corpus", # dataset = dataset_name annotation_scheme="plain", train_fraction=0.1, # for testing val_fraction=1.0, test_fraction=1.0, train_on_val=False, train_on_test=False, checkpoints=True, logging_level="info", multiple_runs=1, seed=42, max_epochs=250, early_stopping=True, monitor="val_loss", min_delta=0.0, patience=0, mode="min", lr_warmup_epochs=2, lr_num_cycles=4, lr_cooldown_restarts=True, lr_cooldown_epochs=7, batch_size=16, max_seq_length=128, lr_max=2e-5, lr_schedule="constant", )
The parameters can be divided into 4 parameter groups:
- Dataset
- Model
- Settings
- Hyperparameters
In the following, we will go through the different parameters step by step to see what they mean.
1. Dataset
Key | Mandatory | Default Value | Type | Values | Comment |
---|---|---|---|---|---|
dataset_name | Yes | --- | str | e.g. conll2003 | key = dataset can be used instead |
annotation_scheme | No | auto | str | auto, plain, bio, bilou | specify annotation scheme (e.g. BIO). auto means it is inferred from data |
train_fraction | No | 1.0 | float | 0.0 - 1.0 | fraction of train dataset to be used |
val_fraction | No | 1.0 | float | 0.0 - 1.0 | fraction of val dataset to be used |
test_fraction | No | 1.0 | float | 0.0 - 1.0 | fraction of test dataset to be used |
train_on_val | No | False | bool | True, False | whether to train additionally on validation dataset |
train_on_test | No | False | bool | True, False | whether to train additionally on test dataset |
Example: static training configuration file with parameters (Dataset)
# my_training.ini
# ..
[dataset]
dataset_name = swedish_ner_corpus
annotation_scheme = plain
train_fraction = 0.1 # for testing
val_fraction = 1.0
test_fraction = 1.0
train_on_val = False
train_on_test = False
2. Model
Key | Mandatory | Default Value | Type | Values | Comment |
---|---|---|---|---|---|
pretrained_model_name | Yes | --- | str | e.g. af-ai-center/bert-base-swedish-uncased | key = model can be used instead |
Example: static training configuration file with parameters (Model)
# my_training.ini
# ..
[model]
pretrained_model_name = af-ai-center/bert-base-swedish-uncased
3. Settings
Key | Mandatory | Default Value | Type | Values | Comment |
---|---|---|---|---|---|
checkpoints | No | True | bool | True, False | whether to save model checkpoints |
logging_level | No | info | str | info, debug | choose logging level, debug is more verbose |
multiple_runs | No | 1 | int | 1+ | choose how often each hyperparameter run is executed (to control for statistical uncertainties) |
seed | No | 42 | int | 1+ | for reproducibility. multiple runs get assigned different seeds. |
Example: static training configuration file with parameters (Settings)
# my_training.ini
# ..
[settings]
checkpoints = True
logging_level = info
multiple_runs = 1
seed = 42
4. Hyperparameters
Key | Mandatory | Default Value | Type | Values | Comment |
---|---|---|---|---|---|
batch_size | No | 16 | int | e.g. 16, 32, 64 | number of training samples in one batch |
max_seq_length | No | 128 | int | e.g. 64, 128, 256 | maximum sequence length used for model's input data |
max_epochs | No | 250 | int | 1+ | (maximum) amount of training epochs |
early_stopping | No | True | bool | True, False | whether to use early stopping |
monitor | No | val_loss | str | val_loss, val_acc | if early stopping is True: metric to monitor (acc = accuracy) |
min_delta | No | 0.0 | float | 0.0+ | if early stopping is True: minimum amount of improvement (w.r.t. monitored metric) required to continue training |
patience | No | 0 | int | 0+ | if early stopping is True: number of epochs to wait for improvement w.r.t. monitored metric until training is stopped |
mode | No | min | str | min, max | if early stopping is True: whether the optimum for the monitored metric is the minimum (val_loss) or maximum (val_acc) value |
lr_warmup_epochs | No | 2 | int | 0+ | number of epochs to linearly increase the learning rate during the warm-up phase, gets translated to num_warmup_steps |
lr_max | No | 2e-5 | float | e.g. 2e-5, 3e-5 | maximum learning rate (after warm-up) for AdamW optimizer |
lr_schedule | No | constant | str | constant, linear, cosine, cosine_with_hard_restarts, hybrid | Learning Rate Schedule, i.e. how to vary the learning rate (after warm-up). hybrid = constant + linear cool-down. |
lr_num_cycles | No | 4 | int | 1+ | num_cycles for lr_schedule = cosine or lr_schedule = cosine_with_hard_restarts |
lr_cooldown_restarts | No | True | bool | True, False | if early stopping is True: whether to restart normal training if monitored metric improves during cool-down phase |
lr_cooldown_epochs | No | 7 | int | 0+ | if early stopping is True or lr_schedule == hybrid: number of epochs to linearly decrease the learning rate during the cool-down phase |
Example: static training configuration file with parameters (Hyperparameters)
# my_training.ini
# ..
[hparams]
max_epochs = 250
early_stopping = True
monitor = val_loss
min_delta = 0.0
patience = 0
mode = min
lr_warmup_epochs = 2
lr_num_cycles = 4
lr_cooldown_restarts = True
lr_cooldown_epochs = 7
[runA]
batch_size = 16
max_seq_length = 128
lr_max = 2e-5
lr_schedule = constant
Presets
In addition to the manual specification of the parameters discussed above,
the dynamic training definition allows for the use of several hyperparameter presets.
They can be specified using the from_preset
argument in Training() like so:
define training dynamically using preset
training = Training("<training_name>", model="<model_name>", dataset="<dataset_name>", from_preset="adaptive")
In the following, we list the different presets together with the Hyperparameters that they entail:
-
from_preset = adaptive
Adaptive fine-tuning (introduced in this paper) is a method that automatically trains for a near-optimal number of epochs. It is used by default in nerblackbox.
adaptive fine-tuning preset
[hparams] max_epochs = 250 early_stopping = True monitor = val_loss min_delta = 0.0 patience = 0 mode = min lr_warmup_epochs = 2 lr_schedule = constant lr_cooldown_epochs = 7
-
from_preset = original
Original fine-tuning uses the hyperparameters from the original BERT paper. The hyperparameters are suitable for large datasets.
original fine-tuning preset
[hparams] max_epochs = 5 early_stopping = False lr_warmup_epochs = 2 lr_schedule = linear
-
from_preset = stable
Stable fine-tuning is a method based on this paper. It is suitable for both small and large datasets.
stable fine-tuning preset
[hparams] max_epochs = 20 early_stopping = False lr_warmup_epochs = 2 lr_schedule = linear
Hyperparameter Search
A hyperparameter grid search can easily be conducted as part of a training (currently only using the static definition).
The hyperparameters one wants to vary are to be specified in special sections [runA]
, [runB]
etc. in the training configuration file.
Example: Hyperparameter Search
# my_training.ini
# ..
[runA]
batch_size = 16
max_seq_length = 128
lr_max = 2e-5
lr_schedule = constant
[runB]
batch_size = 32
max_seq_length = 64
lr_max = 3e-5
lr_schedule = cosine
runA
& runB
). Each hyperparameter run is executed multiple_runs
times (see Parameters).
Multiple Seeds
The results of a training run depend on the employed random seed, see e.g. this paper for a discussion. One may conduct multiple runs with different seeds that are otherwise identical, in order to
-
get control over the uncertainties (see Detailed Results)
-
get an improved model performance
Multiple runs can easily be specified in the training configuration.
Example: Settings / Multiple Runs
# my_training.ini
# ..
[settings]
multiple_runs = 3
seed = 42
This creates 3 runs with seeds 43, 44 and 45.
Detailed Results
In addition to the Main Results, one may have a look at much more detailed results of a training run
using mlflow
or tensorboard
.
Detailed Results
Store.mlflow("start") # + enter http://127.0.0.1:5000 in your browser
Store.tensorboard("start") # + enter http://127.0.0.1:6006 in your browser
nerblackbox mlflow # + enter http://127.0.0.1:5000 in your browser
nerblackbox tensorboard # + enter http://127.0.0.1:6006 in your browser
Python: The underlying processes can be stopped using
Store.mlflow("stop")
and Store.tensorboard("stop")
.
-
mlflow
displays precision, recall and f1 score for every single class, as well the respective micro- and macro-averages over all classes, both on the token and entity level.The following excerpt shows
-
the micro- and macro-averages of the recall on the entity level
-
precision, recall and f1 score for the LOC(ation) class on the token level
In addition, one has access to the log file and the confusion matrices (token and entity level) of the model predictions on the test set. A small excerpt is shown below:
-
-
tensorboard
shows the learning curves of important metrics like the loss and the f1 score.A small excerpt is shown below: