Skip to content

Model

model that predicts tags for given input text

__init__(checkpoint_directory, batch_size=16, max_seq_length=None, dataset=None)

Parameters:

Name Type Description Default
checkpoint_directory str

path to the checkpoint directory

required
batch_size int

batch size used by dataloader

16
max_seq_length Optional[int]

maximum sequence length (Optional). Loaded from checkpoint if not specified.

None
dataset Optional[str]

should be provided in case model is missing information on id2label in config

None

evaluate_on_dataset(dataset_name, dataset_format='infer', phase='test', class_mapping=None, number=None, derived_from_jsonl=False, rounded_decimals=3)

evaluate model on dataset from huggingface or local dataset in jsonl or csv format

Parameters:

Name Type Description Default
dataset_name str

e.g. 'conll2003'

required
dataset_format str

'huggingface', 'jsonl', 'csv'

'infer'
phase str

e.g. 'test'

'test'
class_mapping Optional[Dict[str, str]]

e.g. {"PER": "PI", "ORG": "PI}

None
number Optional[int]

e.g. 100

None
derived_from_jsonl bool False
rounded_decimals Optional[int]

if not None, results will be rounded to provided decimals

3

Returns:

Name Type Description
evaluation_dict EVALUATION_DICT

Dict with keys [label][level][metric] where label in ['micro', 'macro'], level in ['entity', 'token'] metric in ['precision', 'recall', 'f1', 'precision_seqeval', 'recall_seqeval', 'f1_seqeval'] and values = float between 0 and 1

from_checkpoint(checkpoint_directory) classmethod

Parameters:

Name Type Description Default
checkpoint_directory str

path to the checkpoint directory

required

Returns:

Name Type Description
model Optional[Model]

best model from training

from_huggingface(repo_id, dataset=None) classmethod

Parameters:

Name Type Description Default
repo_id str

id of the huggingface hub repo id, e.g. 'KB/bert-base-swedish-cased-ner'

required
dataset Optional[str]

should be provided in case model is missing information on id2label in config

None

Returns:

Name Type Description
model Optional[Model]

model

from_training(training_name) classmethod

load best model from training.

Parameters:

Name Type Description Default
training_name str

name of the training, e.g. "training0"

required

Returns:

Name Type Description
model Optional[Model]

best model from training

predict(input_texts, level='entity', autocorrect=False, is_pretokenized=False)

predict tags for input texts. output on entity or word level.

Examples:

predict(["arbetsförmedlingen finns i stockholm"], level="word", autocorrect=False)
# [[
#     {"char_start": "0", "char_end": "18", "token": "arbetsförmedlingen", "tag": "I-ORG"},
#     {"char_start": "19", "char_end": "24", "token": "finns", "tag": "O"},
#     {"char_start": "25", "char_end": "26", "token": "i", "tag": "O"},
#     {"char_start": "27", "char_end": "36", "token": "stockholm", "tag": "B-LOC"},
# ]]
predict(["arbetsförmedlingen finns i stockholm"], level="word", autocorrect=True)
# [[
#     {"char_start": "0", "char_end": "18", "token": "arbetsförmedlingen", "tag": "B-ORG"},
#     {"char_start": "19", "char_end": "24", "token": "finns", "tag": "O"},
#     {"char_start": "25", "char_end": "26", "token": "i", "tag": "O"},
#     {"char_start": "27", "char_end": "36", "token": "stockholm", "tag": "B-LOC"},
# ]]
predict(["arbetsförmedlingen finns i stockholm"], level="entity", autocorrect=False)
# [[
#     {"char_start": "27", "char_end": "36", "token": "stockholm", "tag": "LOC"},
# ]]
predict(["arbetsförmedlingen finns i stockholm"], level="entity", autocorrect=True)
# [[
#     {"char_start": "0", "char_end": "18", "token": "arbetsförmedlingen", "tag": "ORG"},
#     {"char_start": "27", "char_end": "36", "token": "stockholm", "tag": "LOC"},
# ]]

Parameters:

Name Type Description Default
input_texts Union[str, List[str]]

e.g. ["example 1", "example 2"]

required
level str

"entity" or "word"

'entity'
autocorrect bool

if True, autocorrect annotation scheme (e.g. B- and I- tags).

False
is_pretokenized bool

True if input_texts are pretokenized

False

Returns:

Name Type Description
predictions PREDICTIONS

[list] of predictions for the different examples. each list contains a [list] of [dict] w/ keys = char_start, char_end, word, tag

predict_on_file(input_file, output_file)

predict tags for all input texts in input file, write results to output file

Parameters:

Name Type Description Default
input_file str

e.g. strangnas/test.jsonl

required
output_file str

e.g. strangnas/test_anonymized.jsonl

required

predict_proba(input_texts, is_pretokenized=False)

predict probability distributions for input texts. output on word level.

Examples:

predict_proba(["arbetsförmedlingen finns i stockholm"])
# [[
#     {"char_start": "0", "char_end": "18", "token": "arbetsförmedlingen", "proba_dist: {"O": 0.21, "B-ORG": 0.56, ..}},
#     {"char_start": "19", "char_end": "24", "token": "finns", "proba_dist: {"O": 0.87, "B-ORG": 0.02, ..}},
#     {"char_start": "25", "char_end": "26", "token": "i", "proba_dist: {"O": 0.95, "B-ORG": 0.01, ..}},
#     {"char_start": "27", "char_end": "36", "token": "stockholm", "proba_dist: {"O": 0.14, "B-ORG": 0.22, ..}},
# ]]

Parameters:

Name Type Description Default
input_texts Union[str, List[str]]

e.g. ["example 1", "example 2"]

required
is_pretokenized bool

True if input_texts are pretokenized

False

Returns:

Name Type Description
predictions PREDICTIONS

[list] of probability predictions for different examples. each list contains a [list] of [dict] w/ keys = char_start, char_end, word, proba_dist where proba_dist = [dict] that maps self.annotation.classes to probabilities