llm.trainers.gpt.data
get_datasets
¶
get_datasets(
*,
dataset_name: str | None = None,
dataset_config_name: str | None = None,
validation_split_percentage: float = 0,
train_file: str | None = None,
validation_file: str | None = None,
keep_linebreaks: bool = True
) -> Dataset | DatasetDict
Get the datasets.
You can either provide your own CSV/JSON/TXT training and evaluation files (see below) or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/ (the dataset will be downloaded automatically from the datasets Hub).
For CSV/JSON files, this script will use the column called 'text' or the first column if no column called 'text' is found. You can easily tweak this behavior (see below).
In distributed training, the load_dataset function guarantee that only one local process can concurrently.
Source code in llm/trainers/gpt/data.py
preprocess_datasets
¶
preprocess_datasets(
*,
raw_datasets: DatasetT,
tokenizer: AutoTokenizer,
accelerator: Accelerator,
num_workers: int | None = None,
overwrite_cache: bool = False,
block_size: int | None = None
) -> DatasetT
Preprocessing the datasets.
Source code in llm/trainers/gpt/data.py
group_texts
¶
Concatenates texts from dataset and generates chunks of block_size.