Skip to content

Start

gpt-sw3-tokenizer - Train, evaluate and analyze BPE tokenizers.


Resources


Installation

git clone https://github.com/flxst/gpt-sw3-tokenizer.git
pip install -r requirements.txt

About

This repository provides easy-to-use tools to sample (weighted) data and subsequently train, evaluate and analyze a tokenizer.

                 
Sampling       Training       Evaluation       Analysis

Features

Sampling

  • customizable amount of (disjunct) sampled data for training and evaluation
  • weighting of different categories and languages

Training

  • support for SentencePiece and HuggingFace
  • customizable tokenizer features (vocabulary size, handling of whitespace and numbers, ..)

Evaluation

  • computation of common tokenizer metrics (unknown rate, fertility, proportion of continued words, ..)

Analysis

  • example tokenization
  • vocabulary overlap and performance comparison across languages
  • effect of the vocabulary size

Citation

@misc{gpt-sw3-tokenizer,
  title = {Training and Evaluation of a Multilingual Tokenizer for {GPT}-{SW3}},
  url = {http://arxiv.org/abs/2304.14780},
  author = {Stollenwerk, Felix},
  year = {2023},
}