Training
How-To
To train a tokenizer on data in the <data_train>
folder, run script_train.py
:
train tokenizer
python script_train.py
--tokenizer_name <tokenizer_name> # e.g. tokenizer1
--dataset_files <dataset_files> # e.g. "all" = all files in <data_train>
[--dataset_filter all] # e.g. "all" = no filter
[--vocab_size 64000] # int, divisible by 128
[--monolingual] # if used, monolingual models are trained
[--library SP] # SP = SentencePiece, HF = HuggingFace
[--unicode_normalization None] # None, NFC, NFKC
[--individual_digits 1] # 0, 1
[--add_prefix_space 1] # 0, 1
[--add_whitespace_tokens 2] # 0, 1 (added at top of vocab), 2 (added at bottom of vocab)
[--add_code_tokens 1] # 0, 1 (added at top of vocab)
[--add_newline_token 0] # 0, 1 (added at top of vocab)
[--minimum_frequency 0] # int >= 0
[--initial_alphabet 0] # 0, 1 (only HF)
[--byte_fallback 1] # 0, 1 (only SP)
[--character_coverage 0.9999] # float, useful if byte_fallback = 1 (only SP)
[--train_extremely_large_corpus 1] # 0, 1 (only SP)
Arguments:
--tokenizer_name
will be the name of the tokenizer, e.g.tokenizer1
--dataset_files
specifies the data files in<data_train>
you would like to include, e.g.<dataset_files> = all
(which uses all files) or<dataset_files> = data_en.jsonl data_sv.jsonl
(which uses two specific files)
Optional Arguments:
--dataset_filter
is an alternative to--dataset_files
(which needs to be set toall
). Any files that contain the specified substring will be included--vocab_size
specifies the desired vocabulary size--monolingual
trains (multiple) monolingual tokenizers instead of a single multilingual one--library
specifies the library to use (SP
= SentencePiece orHF
= HuggingFace)--unicode_normalization
specifies the unicode normalization that the data is preprocessed with--individual_digits
splits digits into separate tokens using whitespace--add_prefix_space
adds dummy whitespace to beginning of first token of text--add_whitespace_tokens
adds 23 consecutive whitespace tokens to the tokenizer's vocabulary--add_code_tokens
adds special code tokens to the tokenizer's vocabulary. These are specified inCODE_TOKENS.csv
--add_newline_token
adds special"\n"
token to the tokenizer's vocabulary--minimum_frequency
specifies the minimum frequency required for a token to be added to the vocabulary Note that this most likely results in a vocabulary size smaller than the vocabulary size specified beforehand
Optional Arguments only available for HuggingFace:
--initial_alphabet
adds 256 single characters to the tokenizer's vocabulary
We refer to the HuggingFace documentation for further details.
Optional Arguments only available for SentencePiece:
--byte_fallback
--character_coverage
--train_extremely_large_corpus
The corresponding SentencePiece features are used. We refer to the SentencePiece documentation for further details.
Results
The trained tokenizer is saved at the folder <output>/YYmmdd_HHMMSS-v<vocab_size>_<tokenizer_name>
and contains the following files:
-
Tokenizer:
model.model
&model.vocab
(iflibrary == "SP"
) or tokenizer.json (iflibrary == "HF"
)model.vocab (SentencePiece)
<pad> 0 <unk> 0 <s> 0 <|endoftext|> 0 <|javascript|> 0 <|python|> 0 <|sql|> 0 <|shell|> 0 <0x00> 0 <0x01> 0 [..] <0xFE> 0 <0xFF> 0 ▁t -0 he -1 ▁a -2 [..] ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ -63713
tokenizer.json (HuggingFace)
{ [..] "truncation": null, "padding": null, "added_tokens": [ { "id": 0, "content": "<|javascript|>", [..] "special": true }, { "id": 1, "content": "<|python|>", [..] "special": true }, { "id": 2, "content": "<|sql|>", [..] "special": true }, { "id": 3, "content": "<|shell|>", [..] "special": true } ], "normalizer": null, "pre_tokenizer": { "type": "Sequence", "pretokenizers": [ { "type": "ByteLevel", "add_prefix_space": true, "trim_offsets": true, "use_regex": true }, { "type": "Digits", "individual_digits": true } ] }, "post_processor": { "type": "ByteLevel", "add_prefix_space": true, "trim_offsets": false, "use_regex": true }, "decoder": { "type": "ByteLevel", "add_prefix_space": true, "trim_offsets": true, "use_regex": true }, "model": { "type": "BPE", "dropout": null, "unk_token": null, "continuing_subword_prefix": null, "end_of_word_suffix": null, "fuse_unk": false, "vocab": { "<|javascript|>": 0, "<|python|>": 1, "<|sql|>": 2, "<|shell|>": 3, "!": 4, "\"": 5, [..] "e on", "e res", "e ase" } }
-
Training parameters:
parameters.txt
parameters.txt
library = SP dataset_files = [..] tokenizer_name = tokenizer1 unicode_normalization = None individual_digits = True add_prefix_space = True add_whitespace_tokens = 2 add_code_tokens = 1 minimum_frequency = 0 byte_fallback = True character_coverage = 0.9999 train_extremely_large_corpus = True vocab_size = 63977 vocab_size_external = 64000 special_tokens = ['<|javascript|>', '<|python|>', '<|sql|>', '<|shell|>'] timestamp = 230912_110630 output_dir = [..]
-
Tokenizer Vocabulary Statistics:
tokenizer_subword_lengths.json
tokenizer_subword_lengths.json
{ "1": 173, "2": 1700, "3": 4965, "4": 8477, [..] "24": 1, "mean": 6.656015625, "vocab_size": 64000, }
-
Dataset Statistics:
overview.json
overview.json
{ "files": 1, "documents_total": 8625, "documents": 8625, "dataset_files": [..], "data_size_total": "0.0121G", "data_size": ["0.0032G", "0.0022G", "0.0014G", "0.0004G", "0.0000G", "0.0038G", "0.0011G"], "time": "7.95s", }