llm.datasets.roberta
Custom RoBERTa dataset provider.
This is designed to work with data produced by the RoBERTa encoder
preprocessing script in
llm.preprocess.roberta
.
RoBERTaDataset
¶
RoBERTaDataset(
input_file: Path | str,
mask_token_id: int,
mask_token_prob: float,
vocab_size: int,
)
RoBERTa pretraining dataset.
Like the PyTorch Dataset
, this dataset is
indexable returning a Sample
.
Samples are randomly masked as runtime using the provided parameters. Next sentence prediction is not supported.
Example
Parameters:
-
input_file
(Path | str
) –HDF5 file to load.
-
mask_token_id
(int
) –ID of the mask token in the vocabulary.
-
mask_token_prob
(float
) –Probability of a given token in the sample being masked.
-
vocab_size
(int
) –Size of the vocabulary. Used to replace masked tokens with a random token 10% of the time.
Source code in llm/datasets/roberta.py
bert_mask_sequence
¶
bert_mask_sequence(
token_ids: LongTensor,
special_tokens_mask: BoolTensor,
mask_token_id: int,
mask_token_prob: float,
vocab_size: int,
) -> tuple[LongTensor, LongTensor]
Randomly mask a BERT training sequence.
Source: transformers/data/data_collator.py
Parameters:
-
token_ids
(LongTensor
) –Input sequence token IDs to mask.
-
special_tokens_mask
(BoolTensor
) –Mask of special tokens in the sequence which should never be masked.
-
mask_token_id
(int
) –ID of the mask token in the vocabulary.
-
mask_token_prob
(float
) –Probability of a given token in the sample being masked.
-
vocab_size
(int
) –Size of the vocabulary. Used to replace masked tokens with a random token 10% of the time.
Returns:
-
tuple[LongTensor, LongTensor]
–Masked
token_ids
and the masked labels.