Evaluation
How-To
To evaluate the tokenizer on data in the <data_eval>
folder, run script_evaluate.py
:
evaluate tokenizer
python script_evaluate.py
--tokenizer_name <tokenizer_name> # e.g. tokenizer1
[--vocab_size <vocab_size>] # e.g. 64000
[--monolingual] # if used, monolingual models are evaluated
[--vocab_size_pruned <vocab_size_pruned>] # e.g. 40000 51200 (only SP)
Arguments:
--tokenizer_name
is the name of the tokenizer, e.g.tokenizer1
Optional Arguments:
--vocab_size
specifies the vocabulary size of the tokenizer (useful if there are several tokenizers with the sametokenizer_name
but different vocabulary sizes)--monolingual
evaluates (multiple) monolingual tokenizers (trained with the same flag)
Optional Arguments only available for SentencePiece:
--vocab_size_pruned
evaluates the tokenizer with different, pruned vocabulary sizes
The script
- applies the tokenizer
<output>/*_<tokenizer_name>
on each dataset<dataset_eval>
in<data_eval>
-
computes the following evaluation metrics (see the paper for details):
unknown_rate
fertility
proportion_of_continued_words
token_frequencies
Results
Results are written to
<output>/evaluation/results_<tokenizer_name>.json
<output>/evaluation/token_frequencies_<tokenizer_name>.json
They will be used for the analysis.