llm.datasets.roberta
Custom RoBERTa dataset provider.
This is designed to work with data produced by the RoBERTa encoder
preprocessing script in
llm.preprocess.roberta.
RoBERTaDataset
¶
RoBERTaDataset(
input_file: Path | str,
mask_token_id: int,
mask_token_prob: float,
vocab_size: int,
)
RoBERTa pretraining dataset.
Like the PyTorch Dataset, this dataset is
indexable returning a Sample.
Samples are randomly masked as runtime using the provided parameters. Next sentence prediction is not supported.
Example
Parameters:
-
input_file(Path | str) –HDF5 file to load.
-
mask_token_id(int) –ID of the mask token in the vocabulary.
-
mask_token_prob(float) –Probability of a given token in the sample being masked.
-
vocab_size(int) –Size of the vocabulary. Used to replace masked tokens with a random token 10% of the time.
Source code in llm/datasets/roberta.py
bert_mask_sequence
¶
bert_mask_sequence(
token_ids: LongTensor,
special_tokens_mask: BoolTensor,
mask_token_id: int,
mask_token_prob: float,
vocab_size: int,
) -> tuple[LongTensor, LongTensor]
Randomly mask a BERT training sequence.
Source: transformers/data/data_collator.py
Parameters:
-
token_ids(LongTensor) –Input sequence token IDs to mask.
-
special_tokens_mask(BoolTensor) –Mask of special tokens in the sequence which should never be masked.
-
mask_token_id(int) –ID of the mask token in the vocabulary.
-
mask_token_prob(float) –Probability of a given token in the sample being masked.
-
vocab_size(int) –Size of the vocabulary. Used to replace masked tokens with a random token 10% of the time.
Returns:
-
tuple[LongTensor, LongTensor]–Masked
token_idsand the masked labels.