llm.datasets.bert
NVIDIA BERT dataset provider.
    
              Bases: NamedTuple
BERT pretraining batch.
instance-attribute
  
¶
    Input sequence token IDs ((batch_size, seq_len)).
instance-attribute
  
¶
    Input sequence attention mask ((batch_size, seq_len)).
Indicates which tokens in input_ids should be attended to (i.e., are not
padding tokens). Also known as the input mask.
instance-attribute
  
¶
    Token IDs indicating the segment labels ((batch_size, seq_len)).
E.g. if input_ids is composed of two distinct segments, the first segment
will have token IDs set to 0 and the second to 1. Also known as the
segment ids.
instance-attribute
  
¶
    True token ID of masked tokens in input_ids ((batch_size, seq_len)).
Indices corresponding to non-masked tokens in input_ids are typically
set to -100 to avoid contributing to the MLM loss.
instance-attribute
  
¶
    Boolean tensor indicating the next sentence label ((batch_size,)).
A true (1) value indicates the next sentence/segment logically follows
the first. If there is only one segment in input_ids, this can be
set to None.
    
              Bases: NamedTuple
instance-attribute
  
¶
    Input sequence attention mask ((seq_len,)).
Indicates which tokens in input_ids should be attended to (i.e., are not
padding tokens). Also known as the input mask.
instance-attribute
  
¶
    Token IDs indicating the segment labels ((seq_len,)).
E.g. if input_ids is composed of two distinct segments, the first segment
will have token IDs set to 0 and the second to 1. Also known as the
segment ids.
instance-attribute
  
¶
    True token ID of masked tokens in input_ids ((seq_len,)).
Indices corresponding to non-masked tokens in input_ids are typically
set to -100 to avoid contributing to the MLM loss.
instance-attribute
  
¶
    Boolean scalar tensor indicating the next sentence label.
A true (1) value indicates the next sentence/segment logically follows
the first. If there is only one segment in input_ids, this can be
set to None.
NvidiaBertDataset(input_file: str)
NVIDIA BERT dataset.
Like the PyTorch Dataset, this dataset is
indexable returning a Sample.
Example
Parameters:
- 
          input_file(str) –HDF5 file to load. 
Source code in llm/datasets/bert.py
                    
get_masked_labels(
    seq_len: int,
    masked_lm_positions: list[int],
    masked_lm_ids: list[int],
) -> ndarray[Any, Any]
Create masked labels array.
Parameters:
- 
          seq_len(int) –Sequence length. 
- 
          masked_lm_positions(list[int]) –Index in sequence of masked tokens 
- 
          masked_lm_ids(list[int]) –True token value for each position in masked_lm_position.
Returns:
- 
              ndarray[Any, Any]–List with length seq_lenwith the true value for each corresponding masked token ininput_idsand -100 for all tokens in input_ids which are not masked.
Source code in llm/datasets/bert.py
              
get_dataloader_from_nvidia_bert_shard(
    input_file: str,
    batch_size: int,
    *,
    num_replicas: int | None = None,
    rank: int | None = None,
    seed: int = 0,
    num_workers: int = 4
) -> DataLoader[Batch]
Create a dataloader from a dataset shard.
Parameters:
- 
          input_file(str) –HDF5 file to load. 
- 
          batch_size(int) –Size of batches yielded by the dataloader. 
- 
          num_replicas(int | None, default:None) –Number of processes participating in distributed training. 
- 
          rank(int | None, default:None) –Rank of the current process within num_replicas.
- 
          seed(int, default:0) –Random seed used to shuffle the sampler. 
- 
          num_workers(int, default:4) –Number of subprocesses to use for data loading. 
Returns:
- 
              DataLoader[Batch]–Dataloader which can be iterated over to yield Batch.
Source code in llm/datasets/bert.py
              
sharded_nvidia_bert_dataset(
    input_dir: str,
    batch_size: int,
    *,
    num_replicas: int | None = None,
    rank: int | None = None,
    seed: int = 0,
    num_workers: int = 4
) -> Generator[Batch, None, None]
Simple generator which yields pretraining batches.
Parameters:
- 
          input_dir(str) –Directory of HDF5 shards to load samples from. 
- 
          batch_size(int) –Size of batches yielded. 
- 
          num_replicas(int | None, default:None) –Number of processes participating in distributed training. 
- 
          rank(int | None, default:None) –Rank of the current process within num_replicas.
- 
          seed(int, default:0) –Random seed used to shuffle the sampler. 
- 
          num_workers(int, default:4) –Number of subprocesses to use for data loading. 
Yields:
- 
              Batch–Batches of pretraining data.