llm.datasets.bert
NVIDIA BERT dataset provider.
Batch
¶
Bases: NamedTuple
BERT pretraining batch.
input_ids
instance-attribute
¶
Input sequence token IDs ((batch_size, seq_len)
).
attention_mask
instance-attribute
¶
Input sequence attention mask ((batch_size, seq_len)
).
Indicates which tokens in input_ids
should be attended to (i.e., are not
padding tokens). Also known as the input mask.
token_type_ids
instance-attribute
¶
Token IDs indicating the segment labels ((batch_size, seq_len)
).
E.g. if input_ids
is composed of two distinct segments, the first segment
will have token IDs set to 0 and the second to 1. Also known as the
segment ids.
masked_labels
instance-attribute
¶
True token ID of masked tokens in input_ids
((batch_size, seq_len)
).
Indices corresponding to non-masked tokens in input_ids
are typically
set to -100
to avoid contributing to the MLM loss.
next_sentence_labels
instance-attribute
¶
Boolean tensor indicating the next sentence label ((batch_size,)
).
A true (1
) value indicates the next sentence/segment logically follows
the first. If there is only one segment in input_ids
, this can be
set to None
.
Sample
¶
Bases: NamedTuple
attention_mask
instance-attribute
¶
Input sequence attention mask ((seq_len,)
).
Indicates which tokens in input_ids
should be attended to (i.e., are not
padding tokens). Also known as the input mask.
token_type_ids
instance-attribute
¶
Token IDs indicating the segment labels ((seq_len,)
).
E.g. if input_ids
is composed of two distinct segments, the first segment
will have token IDs set to 0 and the second to 1. Also known as the
segment ids.
masked_labels
instance-attribute
¶
True token ID of masked tokens in input_ids
((seq_len,)
).
Indices corresponding to non-masked tokens in input_ids
are typically
set to -100
to avoid contributing to the MLM loss.
next_sentence_label
instance-attribute
¶
Boolean scalar tensor indicating the next sentence label.
A true (1
) value indicates the next sentence/segment logically follows
the first. If there is only one segment in input_ids
, this can be
set to None
.
NvidiaBertDataset
¶
NvidiaBertDataset(input_file: str)
NVIDIA BERT dataset.
Like the PyTorch Dataset
, this dataset is
indexable returning a Sample
.
Example
Parameters:
-
input_file
(str
) –HDF5 file to load.
Source code in llm/datasets/bert.py
get_masked_labels
¶
get_masked_labels(
seq_len: int,
masked_lm_positions: list[int],
masked_lm_ids: list[int],
) -> ndarray[Any, Any]
Create masked labels array.
Parameters:
-
seq_len
(int
) –Sequence length.
-
masked_lm_positions
(list[int]
) –Index in sequence of masked tokens
-
masked_lm_ids
(list[int]
) –True token value for each position in
masked_lm_position
.
Returns:
-
ndarray[Any, Any]
–List with length
seq_len
with the true value for each corresponding masked token ininput_ids
and -100 for all tokens in input_ids which are not masked.
Source code in llm/datasets/bert.py
get_dataloader_from_nvidia_bert_shard
¶
get_dataloader_from_nvidia_bert_shard(
input_file: str,
batch_size: int,
*,
num_replicas: int | None = None,
rank: int | None = None,
seed: int = 0,
num_workers: int = 4
) -> DataLoader[Batch]
Create a dataloader from a dataset shard.
Parameters:
-
input_file
(str
) –HDF5 file to load.
-
batch_size
(int
) –Size of batches yielded by the dataloader.
-
num_replicas
(int | None
, default:None
) –Number of processes participating in distributed training.
-
rank
(int | None
, default:None
) –Rank of the current process within
num_replicas
. -
seed
(int
, default:0
) –Random seed used to shuffle the sampler.
-
num_workers
(int
, default:4
) –Number of subprocesses to use for data loading.
Returns:
-
DataLoader[Batch]
–Dataloader which can be iterated over to yield
Batch
.
Source code in llm/datasets/bert.py
sharded_nvidia_bert_dataset
¶
sharded_nvidia_bert_dataset(
input_dir: str,
batch_size: int,
*,
num_replicas: int | None = None,
rank: int | None = None,
seed: int = 0,
num_workers: int = 4
) -> Generator[Batch, None, None]
Simple generator which yields pretraining batches.
Parameters:
-
input_dir
(str
) –Directory of HDF5 shards to load samples from.
-
batch_size
(int
) –Size of batches yielded.
-
num_replicas
(int | None
, default:None
) –Number of processes participating in distributed training.
-
rank
(int | None
, default:None
) –Rank of the current process within
num_replicas
. -
seed
(int
, default:0
) –Random seed used to shuffle the sampler.
-
num_workers
(int
, default:4
) –Number of subprocesses to use for data loading.
Yields:
-
Batch
–Batches of pretraining data.