Preparation

Environment

The environment file env.ini can be used to specify paths and settings. By default, it looks like this:

env.ini

[main]
data_original = data_original
data_train = data_train
data_eval = data_eval
output = output

[sampling]
weights = SAMPLING_WEIGHTS.csv

[other]
debug = 0
verbose = 0

In the [main] section, the following folders are given:

<data_original>: contains original text data
<data_train>: contains sampled text data for training
<data_eval>: contains sampled text data for evaluation
<output>: contains trained tokenizers (incl. vocabulary, merge rules, parameters)

The [sampling] section contains the path to the sampling weights file (see Sampling). The parameters in the [other] section should be 0 or 1 and can be used for debugging or verbose output.

Data Format

The files contained in the folder <data_original> must adhere to the following naming convention:
```
<category>_<language>.jsonl
```
Data split into multiple categories (e.g. books, articles, ..) and/or languages (e.g. sv, da, ..) like this can be weighted in a customized way (as described in Sampling).
Each row in a file needs to be formatted like this:
```
{"text": "..."} 
```
Fields other than "text" may be present but will be ignored.