Model
model that predicts tags for given input text
__init__(checkpoint_directory, batch_size=16, max_seq_length=None, dataset=None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
checkpoint_directory |
str
|
path to the checkpoint directory |
required |
batch_size |
int
|
batch size used by dataloader |
16
|
max_seq_length |
Optional[int]
|
maximum sequence length (Optional). Loaded from checkpoint if not specified. |
None
|
dataset |
Optional[str]
|
should be provided in case model is missing information on id2label in config |
None
|
evaluate_on_dataset(dataset_name, dataset_format='infer', phase='test', class_mapping=None, number=None, derived_from_jsonl=False, rounded_decimals=3)
evaluate model on dataset from huggingface or local dataset in jsonl or csv format
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_name |
str
|
e.g. 'conll2003' |
required |
dataset_format |
str
|
'huggingface', 'jsonl', 'csv' |
'infer'
|
phase |
str
|
e.g. 'test' |
'test'
|
class_mapping |
Optional[Dict[str, str]]
|
e.g. {"PER": "PI", "ORG": "PI} |
None
|
number |
Optional[int]
|
e.g. 100 |
None
|
derived_from_jsonl |
bool
|
False
|
|
rounded_decimals |
Optional[int]
|
if not None, results will be rounded to provided decimals |
3
|
Returns:
Name | Type | Description |
---|---|---|
evaluation_dict |
EVALUATION_DICT
|
Dict with keys [label][level][metric] where label in ['micro', 'macro'], level in ['entity', 'token'] metric in ['precision', 'recall', 'f1', 'precision_seqeval', 'recall_seqeval', 'f1_seqeval'] and values = float between 0 and 1 |
from_checkpoint(checkpoint_directory)
classmethod
Parameters:
Name | Type | Description | Default |
---|---|---|---|
checkpoint_directory |
str
|
path to the checkpoint directory |
required |
Returns:
Name | Type | Description |
---|---|---|
model |
Optional[Model]
|
best model from training |
from_huggingface(repo_id, dataset=None)
classmethod
Parameters:
Name | Type | Description | Default |
---|---|---|---|
repo_id |
str
|
id of the huggingface hub repo id, e.g. 'KB/bert-base-swedish-cased-ner' |
required |
dataset |
Optional[str]
|
should be provided in case model is missing information on id2label in config |
None
|
Returns:
Name | Type | Description |
---|---|---|
model |
Optional[Model]
|
model |
from_training(training_name)
classmethod
load best model from training.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
training_name |
str
|
name of the training, e.g. "training0" |
required |
Returns:
Name | Type | Description |
---|---|---|
model |
Optional[Model]
|
best model from training |
predict(input_texts, level='entity', autocorrect=False, is_pretokenized=False)
predict tags for input texts. output on entity or word level.
Examples:
predict(["arbetsförmedlingen finns i stockholm"], level="word", autocorrect=False)
# [[
# {"char_start": "0", "char_end": "18", "token": "arbetsförmedlingen", "tag": "I-ORG"},
# {"char_start": "19", "char_end": "24", "token": "finns", "tag": "O"},
# {"char_start": "25", "char_end": "26", "token": "i", "tag": "O"},
# {"char_start": "27", "char_end": "36", "token": "stockholm", "tag": "B-LOC"},
# ]]
predict(["arbetsförmedlingen finns i stockholm"], level="word", autocorrect=True)
# [[
# {"char_start": "0", "char_end": "18", "token": "arbetsförmedlingen", "tag": "B-ORG"},
# {"char_start": "19", "char_end": "24", "token": "finns", "tag": "O"},
# {"char_start": "25", "char_end": "26", "token": "i", "tag": "O"},
# {"char_start": "27", "char_end": "36", "token": "stockholm", "tag": "B-LOC"},
# ]]
predict(["arbetsförmedlingen finns i stockholm"], level="entity", autocorrect=False)
# [[
# {"char_start": "27", "char_end": "36", "token": "stockholm", "tag": "LOC"},
# ]]
predict(["arbetsförmedlingen finns i stockholm"], level="entity", autocorrect=True)
# [[
# {"char_start": "0", "char_end": "18", "token": "arbetsförmedlingen", "tag": "ORG"},
# {"char_start": "27", "char_end": "36", "token": "stockholm", "tag": "LOC"},
# ]]
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_texts |
Union[str, List[str]]
|
e.g. ["example 1", "example 2"] |
required |
level |
str
|
"entity" or "word" |
'entity'
|
autocorrect |
bool
|
if True, autocorrect annotation scheme (e.g. B- and I- tags). |
False
|
is_pretokenized |
bool
|
True if input_texts are pretokenized |
False
|
Returns:
Name | Type | Description |
---|---|---|
predictions |
PREDICTIONS
|
[list] of predictions for the different examples. each list contains a [list] of [dict] w/ keys = char_start, char_end, word, tag |
predict_on_file(input_file, output_file)
predict tags for all input texts in input file, write results to output file
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_file |
str
|
e.g. strangnas/test.jsonl |
required |
output_file |
str
|
e.g. strangnas/test_anonymized.jsonl |
required |
predict_proba(input_texts, is_pretokenized=False)
predict probability distributions for input texts. output on word level.
Examples:
predict_proba(["arbetsförmedlingen finns i stockholm"])
# [[
# {"char_start": "0", "char_end": "18", "token": "arbetsförmedlingen", "proba_dist: {"O": 0.21, "B-ORG": 0.56, ..}},
# {"char_start": "19", "char_end": "24", "token": "finns", "proba_dist: {"O": 0.87, "B-ORG": 0.02, ..}},
# {"char_start": "25", "char_end": "26", "token": "i", "proba_dist: {"O": 0.95, "B-ORG": 0.01, ..}},
# {"char_start": "27", "char_end": "36", "token": "stockholm", "proba_dist: {"O": 0.14, "B-ORG": 0.22, ..}},
# ]]
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_texts |
Union[str, List[str]]
|
e.g. ["example 1", "example 2"] |
required |
is_pretokenized |
bool
|
True if input_texts are pretokenized |
False
|
Returns:
Name | Type | Description |
---|---|---|
predictions |
PREDICTIONS
|
[list] of probability predictions for different examples. each list contains a [list] of [dict] w/ keys = char_start, char_end, word, proba_dist where proba_dist = [dict] that maps self.annotation.classes to probabilities |