Sampling

Note: If you want to skip this step, just point the <data_train> to the <data_original> folder in the environment file and proceed with the training.

Often times, especially in the case of very large datasets, one only wants to use a certain fraction of the original data for the tokenizer training and evaluation. Moreover, the sampled training and evaluation data should be disjunct. Finally, the data is sometimes weighted for tokenizer (and model) training, see e.g. GPT-3 or GPT-SW3.

How-To

To sample (and weight) data from the original files in <data_original>, take the following steps:

Specify the categories, languages and their corresponding weights in SAMPLING_WEIGHTS.csv:
SAMPLING_WEIGHTS.csv
```
category,sv,en
articles,1,0.5
books,0.7,1
```
The above example contains 2 categories (articles & books) and 2 languages (sv & en).
Run script_sampling.py:
sample data
```
python script_sampling.py 
    --percent <percent>   # e.g. 10
    [--evaluation 0]      # 0 = <data_train>, 1 = <data_eval>
```
Arguments:
- --percent is the fraction of documents with respect to original data in percent
Optional Arguments:
- --evaluation can be used to sample data for evaluation instead of training

Note that

for each combination $x = cl$ of a category $c$ and language $l$, the fraction of sampled documents is given by the product of the individual weight $W_x$ read from SAMPLING_WEIGHTS.csv and the global factor $p$ specified via --percent:

$$ \left(\frac{{\text{size of sampled data}}}{{\text{size of original data}}}\right)_x \approx \left(\frac{{\text{number of sampled documents}}}{{\text{number of original documents}}}\right)_x = W_x \cdot p $$

when evaluation data is sampled from the original data, the previously sampled training data is excluded in order to ensure disjunct samples

Results

The sampled (and weighted) data files are called <category>_<language>.jsonl (just like their original counterparts, see preparation) and can be found in the folder

<data_train> if --evaluation 0 is used
<data_eval> if --evaluation 1 is used

In addition, the folder contains a log file SAMPLING.log which contains information about the sampling process.

In the next steps, the data are used for training and evaluation, respectively.