Training

How-To

To train a tokenizer on data in the <data_train> folder, run script_train.py:

train tokenizer

python script_train.py 
  --tokenizer_name <tokenizer_name>      # e.g. tokenizer1
  --dataset_files <dataset_files>        # e.g. "all" = all files in <data_train>
  [--dataset_filter all]                 # e.g. "all" = no filter
  [--vocab_size 64000]                   # int, divisible by 128
  [--monolingual]                        # if used, monolingual models are trained
  [--library SP]                         # SP = SentencePiece, HF = HuggingFace
  [--unicode_normalization None]         # None, NFC, NFKC
  [--individual_digits 1]                # 0, 1
  [--add_prefix_space 1]                 # 0, 1
  [--add_whitespace_tokens 2]            # 0, 1 (added at top of vocab), 2 (added at bottom of vocab)
  [--add_code_tokens 1]                  # 0, 1 (added at top of vocab)
  [--add_newline_token 0]                # 0, 1 (added at top of vocab)
  [--minimum_frequency 0]                # int >= 0
  [--initial_alphabet 0]                 # 0, 1 (only HF)
  [--byte_fallback 1]                    # 0, 1 (only SP)
  [--character_coverage 0.9999]          # float, useful if byte_fallback = 1 (only SP)
  [--train_extremely_large_corpus 1]     # 0, 1 (only SP)

Arguments:

--tokenizer_name will be the name of the tokenizer, e.g. tokenizer1
--dataset_files specifies the data files in <data_train> you would like to include, e.g.
- <dataset_files> = all (which uses all files) or
- <dataset_files> = data_en.jsonl data_sv.jsonl (which uses two specific files)

Optional Arguments:

--dataset_filter is an alternative to --dataset_files (which needs to be set to all). Any files that contain the specified substring will be included
--vocab_size specifies the desired vocabulary size
--monolingual trains (multiple) monolingual tokenizers instead of a single multilingual one
--library specifies the library to use (SP = SentencePiece or HF = HuggingFace)
--unicode_normalization specifies the unicode normalization that the data is preprocessed with
--individual_digits splits digits into separate tokens using whitespace
--add_prefix_space adds dummy whitespace to beginning of first token of text
--add_whitespace_tokens adds 23 consecutive whitespace tokens to the tokenizer's vocabulary
--add_code_tokens adds special code tokens to the tokenizer's vocabulary. These are specified in CODE_TOKENS.csv
--add_newline_token adds special "\n" token to the tokenizer's vocabulary
--minimum_frequency specifies the minimum frequency required for a token to be added to the vocabulary Note that this most likely results in a vocabulary size smaller than the vocabulary size specified beforehand

Optional Arguments only available for HuggingFace:

--initial_alphabet adds 256 single characters to the tokenizer's vocabulary

We refer to the HuggingFace documentation for further details.

Optional Arguments only available for SentencePiece:

--byte_fallback
--character_coverage
--train_extremely_large_corpus

The corresponding SentencePiece features are used. We refer to the SentencePiece documentation for further details.

Results

The trained tokenizer is saved at the folder <output>/YYmmdd_HHMMSS-v<vocab_size>_<tokenizer_name> and contains the following files:

Tokenizer: model.model & model.vocab (if library == "SP") or tokenizer.json (if library == "HF")

model.vocab (SentencePiece)

<pad> 0
<unk> 0
<s> 0
<|endoftext|> 0
<|javascript|> 0
<|python|> 0
<|sql|> 0
<|shell|> 0
<0x00> 0
<0x01> 0

[..]

<0xFE> 0
<0xFF> 0
▁t -0
he -1
▁a -2

[..]

▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ -63713

tokenizer.json (HuggingFace)

{
    [..]
    "truncation": null,
    "padding": null,
    "added_tokens": [
        {
            "id": 0,
            "content": "<|javascript|>",
            [..]
            "special": true
        },
        {
            "id": 1,
            "content": "<|python|>",
            [..]
            "special": true
        },
        {
            "id": 2,
            "content": "<|sql|>",
            [..]
            "special": true
        },
        {
            "id": 3,
            "content": "<|shell|>",
            [..]
            "special": true
        }
    ],
    "normalizer": null,
    "pre_tokenizer": {
        "type": "Sequence",
        "pretokenizers": [
            {
                "type": "ByteLevel",
                "add_prefix_space": true,
                "trim_offsets": true,
                "use_regex": true
            },
            {
                "type": "Digits",
                "individual_digits": true
            }
        ]
    },
    "post_processor": {
        "type": "ByteLevel",
        "add_prefix_space": true,
        "trim_offsets": false,
        "use_regex": true
    },
    "decoder": {
        "type": "ByteLevel",
        "add_prefix_space": true,
        "trim_offsets": true,
        "use_regex": true
    },
    "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": null,
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": false,
    "vocab": {
        "<|javascript|>": 0,
        "<|python|>": 1,
        "<|sql|>": 2,
        "<|shell|>": 3,
        "!": 4,
        "\"": 5,

        [..]

        "e on",
        "e res",
        "e ase"
    }
}

Training parameters: parameters.txt

parameters.txt

library = SP
dataset_files = [..]
tokenizer_name = tokenizer1
unicode_normalization = None
individual_digits = True
add_prefix_space = True
add_whitespace_tokens = 2
add_code_tokens = 1
minimum_frequency = 0
byte_fallback = True
character_coverage = 0.9999
train_extremely_large_corpus = True
vocab_size = 63977
vocab_size_external = 64000
special_tokens = ['<|javascript|>', '<|python|>', '<|sql|>', '<|shell|>']
timestamp = 230912_110630
output_dir = [..]

Tokenizer Vocabulary Statistics: tokenizer_subword_lengths.json

tokenizer_subword_lengths.json

{
    "1": 173,
    "2": 1700, 
    "3": 4965, 
    "4": 8477, 
    [..]
    "24": 1, 
    "mean": 6.656015625, 
    "vocab_size": 64000,
}

Dataset Statistics: overview.json

overview.json

{
    "files": 1, 
    "documents_total": 8625, 
    "documents": 8625, 
    "dataset_files": [..], 
    "data_size_total": "0.0121G", 
    "data_size": ["0.0032G", "0.0022G", "0.0014G", "0.0004G", "0.0000G", "0.0038G", "0.0011G"], 
    "time": "7.95s",
}