llm.datasets.bert
NVIDIA BERT dataset provider.
Batch
¶
Bases: NamedTuple
BERT pretraining batch.
input_ids
instance-attribute
¶
Input sequence token IDs ((batch_size, seq_len)).
attention_mask
instance-attribute
¶
Input sequence attention mask ((batch_size, seq_len)).
Indicates which tokens in input_ids should be attended to (i.e., are not
padding tokens). Also known as the input mask.
token_type_ids
instance-attribute
¶
Token IDs indicating the segment labels ((batch_size, seq_len)).
E.g. if input_ids is composed of two distinct segments, the first segment
will have token IDs set to 0 and the second to 1. Also known as the
segment ids.
masked_labels
instance-attribute
¶
True token ID of masked tokens in input_ids ((batch_size, seq_len)).
Indices corresponding to non-masked tokens in input_ids are typically
set to -100 to avoid contributing to the MLM loss.
next_sentence_labels
instance-attribute
¶
Boolean tensor indicating the next sentence label ((batch_size,)).
A true (1) value indicates the next sentence/segment logically follows
the first. If there is only one segment in input_ids, this can be
set to None.
Sample
¶
Bases: NamedTuple
attention_mask
instance-attribute
¶
Input sequence attention mask ((seq_len,)).
Indicates which tokens in input_ids should be attended to (i.e., are not
padding tokens). Also known as the input mask.
token_type_ids
instance-attribute
¶
Token IDs indicating the segment labels ((seq_len,)).
E.g. if input_ids is composed of two distinct segments, the first segment
will have token IDs set to 0 and the second to 1. Also known as the
segment ids.
masked_labels
instance-attribute
¶
True token ID of masked tokens in input_ids ((seq_len,)).
Indices corresponding to non-masked tokens in input_ids are typically
set to -100 to avoid contributing to the MLM loss.
next_sentence_label
instance-attribute
¶
Boolean scalar tensor indicating the next sentence label.
A true (1) value indicates the next sentence/segment logically follows
the first. If there is only one segment in input_ids, this can be
set to None.
NvidiaBertDataset
¶
NvidiaBertDataset(input_file: str)
NVIDIA BERT dataset.
Like the PyTorch Dataset, this dataset is
indexable returning a Sample.
Example
Parameters:
-
input_file(str) –HDF5 file to load.
Source code in llm/datasets/bert.py
get_masked_labels
¶
get_masked_labels(
seq_len: int,
masked_lm_positions: list[int],
masked_lm_ids: list[int],
) -> ndarray[Any, Any]
Create masked labels array.
Parameters:
-
seq_len(int) –Sequence length.
-
masked_lm_positions(list[int]) –Index in sequence of masked tokens
-
masked_lm_ids(list[int]) –True token value for each position in
masked_lm_position.
Returns:
-
ndarray[Any, Any]–List with length
seq_lenwith the true value for each corresponding masked token ininput_idsand -100 for all tokens in input_ids which are not masked.
Source code in llm/datasets/bert.py
get_dataloader_from_nvidia_bert_shard
¶
get_dataloader_from_nvidia_bert_shard(
input_file: str,
batch_size: int,
*,
num_replicas: int | None = None,
rank: int | None = None,
seed: int = 0,
num_workers: int = 4
) -> DataLoader[Batch]
Create a dataloader from a dataset shard.
Parameters:
-
input_file(str) –HDF5 file to load.
-
batch_size(int) –Size of batches yielded by the dataloader.
-
num_replicas(int | None, default:None) –Number of processes participating in distributed training.
-
rank(int | None, default:None) –Rank of the current process within
num_replicas. -
seed(int, default:0) –Random seed used to shuffle the sampler.
-
num_workers(int, default:4) –Number of subprocesses to use for data loading.
Returns:
-
DataLoader[Batch]–Dataloader which can be iterated over to yield
Batch.
Source code in llm/datasets/bert.py
sharded_nvidia_bert_dataset
¶
sharded_nvidia_bert_dataset(
input_dir: str,
batch_size: int,
*,
num_replicas: int | None = None,
rank: int | None = None,
seed: int = 0,
num_workers: int = 4
) -> Generator[Batch, None, None]
Simple generator which yields pretraining batches.
Parameters:
-
input_dir(str) –Directory of HDF5 shards to load samples from.
-
batch_size(int) –Size of batches yielded.
-
num_replicas(int | None, default:None) –Number of processes participating in distributed training.
-
rank(int | None, default:None) –Rank of the current process within
num_replicas. -
seed(int, default:0) –Random seed used to shuffle the sampler.
-
num_workers(int, default:4) –Number of subprocesses to use for data loading.
Yields:
-
Batch–Batches of pretraining data.