Skip to content

Dataset

class to download, set up and inspect a single dataset

__init__(name, source, pretokenized=False, split=False, file_path=None, subset=None)

Parameters:

Name Type Description Default
name str

name of dataset, e.g. "swedish_ner_corpus"

required
source str

source of dataset, e.g. "HF", "BI", "LF"

required
pretokenized bool

[only for source = "LF"] whether the dataset is pretokenized. otherwise, it has the standard type.

False
split bool

[only for source = "LF"] whether the dataset is split into train/val/test subsets. otherwise, it is a single file.

False
file_path Optional[str]

[only for source = "LF"] absolute file_path

None
subset Optional[str]

[only for source = "HF"] name of subset if applicable, e.g. "simple_cased"

None

set_up(val_fraction=None, test_fraction=None)

sets up the dataset and creates the following files (if needed):

  • <STORE_DIR>/datasets/<name>/train.*
  • <STORE_DIR>/datasets/<name>/val.*
  • <STORE_DIR>/datasets/<name>/test.*

where * = jsonl or * = csv, depending on whether the data is pretokenized or not.

Parameters:

Name Type Description Default
val_fraction Optional[float]

e.g. 0.1 (applicable if source = HF, BI, LF)

None
test_fraction Optional[float]

e.g. 0.1 (applicable if source = LF)

None