Sampling
Note: If you want to skip this step, just point the
<data_train>
to the <data_original>
folder
in the environment file and proceed with the training.
Often times, especially in the case of very large datasets, one only wants to use a certain fraction of the original data for the tokenizer training and evaluation. Moreover, the sampled training and evaluation data should be disjunct. Finally, the data is sometimes weighted for tokenizer (and model) training, see e.g. GPT-3 or GPT-SW3.
How-To
To sample (and weight) data from the original files in <data_original>
, take the following steps:
-
Specify the categories, languages and their corresponding weights in
SAMPLING_WEIGHTS.csv
:SAMPLING_WEIGHTS.csv
category,sv,en articles,1,0.5 books,0.7,1
The above example contains 2 categories (
articles
&books
) and 2 languages (sv
&en
). -
Run
script_sampling.py
:sample data
python script_sampling.py --percent <percent> # e.g. 10 [--evaluation 0] # 0 = <data_train>, 1 = <data_eval>
Arguments:
--percent
is the fraction of documents with respect to original data in percent
Optional Arguments:--evaluation
can be used to sample data for evaluation instead of training
Note that
- for each combination $x = cl$ of a category $c$ and language $l$, the fraction of sampled documents is given by the product of the
individual weight $W_x$ read from
SAMPLING_WEIGHTS.csv
and the global factor $p$ specified via--percent
:
$$ \left(\frac{{\text{size of sampled data}}}{{\text{size of original data}}}\right)_x \approx \left(\frac{{\text{number of sampled documents}}}{{\text{number of original documents}}}\right)_x = W_x \cdot p $$
- when evaluation data is sampled from the original data, the previously sampled training data is excluded in order to ensure disjunct samples
Results
The sampled (and weighted) data files are called <category>_<language>.jsonl
(just like their original counterparts, see preparation) and can be found in the folder
<data_train>
if--evaluation 0
is used<data_eval>
if--evaluation 1
is used
In addition, the folder contains a log file SAMPLING.log
which contains information about the sampling process.
In the next steps, the data are used for training and evaluation, respectively.