Dataset
class to download, set up and inspect a single dataset
__init__(name, source, pretokenized=False, split=False, file_path=None, subset=None)
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name |
str
|
name of dataset, e.g. "swedish_ner_corpus" |
required |
source |
str
|
source of dataset, e.g. "HF", "BI", "LF" |
required |
pretokenized |
bool
|
[only for source = "LF"] whether the dataset is pretokenized. otherwise, it has the standard type. |
False
|
split |
bool
|
[only for source = "LF"] whether the dataset is split into train/val/test subsets. otherwise, it is a single file. |
False
|
file_path |
Optional[str]
|
[only for source = "LF"] absolute file_path |
None
|
subset |
Optional[str]
|
[only for source = "HF"] name of subset if applicable, e.g. "simple_cased" |
None
|
set_up(val_fraction=None, test_fraction=None)
sets up the dataset and creates the following files (if needed):
<STORE_DIR>/datasets/<name>/train.*
<STORE_DIR>/datasets/<name>/val.*
<STORE_DIR>/datasets/<name>/test.*
where * = jsonl
or * = csv
, depending on whether the data is pretokenized or not.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
val_fraction |
Optional[float]
|
e.g. 0.1 (applicable if source = HF, BI, LF) |
None
|
test_fraction |
Optional[float]
|
e.g. 0.1 (applicable if source = LF) |
None
|