Preparation
Environment
The environment file env.ini
can be used to specify paths and settings.
By default, it looks like this:
env.ini
[main]
data_original = data_original
data_train = data_train
data_eval = data_eval
output = output
[sampling]
weights = SAMPLING_WEIGHTS.csv
[other]
debug = 0
verbose = 0
In the [main]
section, the following folders are given:
<data_original>
: contains original text data<data_train>
: contains sampled text data for training<data_eval>
: contains sampled text data for evaluation<output>
: contains trained tokenizers (incl. vocabulary, merge rules, parameters)
The [sampling]
section contains the path to the sampling weights file (see Sampling).
The parameters in the [other]
section should be 0 or 1 and can be used for debugging or verbose output.
Data Format
-
The files contained in the folder
<data_original>
must adhere to the following naming convention:Data split into multiple categories (e.g.<category>_<language>.jsonl
books
,articles
, ..) and/or languages (e.g.sv
,da
, ..) like this can be weighted in a customized way (as described in Sampling). -
Each row in a file needs to be formatted like this:
Fields other than{"text": "..."}
"text"
may be present but will be ignored.