Skip to content

Text Encoding

Text may contain whitespace characters (e.g. "\n", "\t") or special characters (e.g. "•", emojis) that a pre-trained model has never seen before. While the whitespace characters are ignored in the tokenization process, the special characters lead to out-of-vocabulary tokens which get replaced by [UNK] tokens before being sent to the model. Sometimes, however, the ignored or replaced tokens contain semantic information that is valuable for the model and thus should be preserved.

Therefore, nerblackbox allows to customly map selected special characters to self-defined special tokens ("encoding"). The encoded text may then be used during training and inference.

Say we want to have the following replacements:

encoding
# map special characters to special tokens
encoding = {
    '\n': '[NEWLINE]',
    '\t': '[TAB]',
    '•': '[DOT]',
}

The first step is to save the encoding in an encoding.json file which is located in the same folder ./store/datasets/<dataset_name> that contains the data (see Data).

create encoding.json
import json

with open('./store/datasets/<custom_dataset>/encoding.json', 'w') as file:
    json.dump(encoding, file)

This way, the special tokens are automatically added to the model's vocabulary during training.


The second step is to apply the encoding to the data. The TextEncoder class takes care of this:

TextEncoder
from nerblackbox import TextEncoder

text_encoder = TextEncoder(encoding)

For training, one needs to encode the input text like so:

text encoding (training)
# ..load input_text 

# ENCODE
# e.g. input_text             = 'We\n are in • Stockholm' 
#      input_text_encoded     = 'We[NEWLINE] are in [DOT] Stockholm'
input_text_encoded, _ = text_encoder.encode(input_text)  

# ..save input_text_encoded and use it for training

For inference, the predictions also need to be mapped back to the original text, like so:

text encoding (inference)
# ENCODE
# e.g. input_text             = 'We\n are in • Stockholm'
#      input_text_encoded     = 'We[NEWLINE] are in [DOT] Stockholm'
#      encode_decode_mappings = [(2, "\n", "[NEWLINE]"), (13, "•", "[DOT]")]
input_text_encoded, encode_decode_mappings = text_encoder.encode(input_text)


# PREDICT
# e.g. predictions_encoded    = {'char_start': 25, 'char_end': 34, 'token': 'Stockholm', 'tag': 'LOC'}
predictions_encoded = model.predict(input_text_encoded, level="entity")


# DECODE
# e.g. input_text_decoded     = 'We\n are in • Stockholm' 
#      predictions            = {'char_start': 13, 'char_end': 22, 'token': 'Stockholm', 'tag': 'LOC'}
input_text_decoded, predictions = text_encoder.decode(input_text_encoded,
                                                      encode_decode_mappings,
                                                      predictions_encoded)