Skip to content

Evaluation


How-To

To evaluate the tokenizer on data in the <data_eval> folder, run script_evaluate.py:

evaluate tokenizer
python script_evaluate.py 
  --tokenizer_name <tokenizer_name>         # e.g. tokenizer1 
  [--vocab_size <vocab_size>]               # e.g. 64000
  [--monolingual]                           # if used, monolingual models are evaluated
  [--vocab_size_pruned <vocab_size_pruned>] # e.g. 40000 51200 (only SP)

Arguments:

  • --tokenizer_name is the name of the tokenizer, e.g. tokenizer1


Optional Arguments:

  • --vocab_size specifies the vocabulary size of the tokenizer (useful if there are several tokenizers with the same tokenizer_name but different vocabulary sizes)
  • --monolingual evaluates (multiple) monolingual tokenizers (trained with the same flag)


Optional Arguments only available for SentencePiece:

  • --vocab_size_pruned evaluates the tokenizer with different, pruned vocabulary sizes


The script

  • applies the tokenizer <output>/*_<tokenizer_name> on each dataset <dataset_eval> in <data_eval>
  • computes the following evaluation metrics (see the paper for details):

    • unknown_rate
    • fertility
    • proportion_of_continued_words
    • token_frequencies

Results

Results are written to

  • <output>/evaluation/results_<tokenizer_name>.json
  • <output>/evaluation/token_frequencies_<tokenizer_name>.json

They will be used for the analysis.