GPT Pretraining¶
This guide walks you through pretraining GPT-like causal language model. Instructions are specific to ALCF's Polaris machine; however, the general steps should apply for any system.
The training script is based on HuggingFace's CLM example.
Installation¶
- Clone the repository.
- Load the Python and CUDA modules on Polaris. These modules will need to be loaded each time you activate the virtual environment.
- Create a virtual environment.
- Install PyTorch and the
llm-pytorch
package. Other versions of PyTorch should work fine. I have personally tests PyTorch 1.13.1 with CUDA 11.7 on Polaris.
Running the Scripts¶
The basic training command is python -m llm.trainers.gpt {options}
.
The script uses HuggingFace Accelerate
to detect the distributed environment.
I suggest using torchrun
for distributed training. E.g.,
Here is an example job script for Polaris (using PBS) which will automatically set up the distributed environment according to your job parameters. Note that this example trains a small 125M parameter GPT model on WikiText. Highlighted lines contain information that you must complete yourself.
After updating pretrain.pbs
accordingly, you can either execute the script
directly in an interactive session or submit it as a batch job.
Interactive:
$ qsub -A {ALLOC} -l select=1:system=polaris -l walltime=1:00:00 -I -q debug -l filesystems=home:grand
$ chmod u+x pretrain.pbs
$ ./pretrain.pbs
Customize Pretraining¶
Model¶
This script uses HuggingFace models. The argument --model_name_or_path
takes
either a path to a saved HuggingFace model directory or the name of a model on
the HuggingFace Hub.
Here's some useful options:
--model_name_or_path EleutherAI/gpt-neo-125m
: Small GPT model useful for debugging.--model_name_or_path EleutherAI/gpt-neo-1.3B
: Works with K-FAC and is almost the same size as GPT-2.--model_name_or_path EleutherAI/gpt-neox-20b
: GPT NeoX 20B.--model_name_or_path gpt2
: HuggingFace's GPT-2 implementation which uses Conv1D layers instead of Linear layers so does not work with K-FAC.
Note that the --low_cpu_mem_usage
option will instantiate the model
architecture for pretraining without downloading the actual weights.
Alternatively, a --config_name
and --tokenizer_name
can be provided where
each can either be a name of an existing model/tokenizer or a path to their
respective HuggingFace compatible configurations.
Dataset¶
There are two ways to provide a pretraining dataset: via HuggingFace Datasets or CSV/JSON/text files.
To use an existing dataset via the Dataset Hub, find the name of the dataset and the name of the subset.
# Generic format
--dataset_name {name} --dataset_config_name {subset}
# WikiText (good for testing)
--dataset_name wikitext --dataset_config_name wikitext-2-raw-v1
# The Pile
--dataset_name EleutherAI/pile --dataset_config_name all
# The Pile-10K (subset for testing)
--dataset_name NeelNanda/pile-10k
Datasets are downloaded to ~/.cache/huggingface/datasets
.
This can be changed by setting HF_DATASETS_CACHE
.
Checkpointing¶
Checkpointing is not enabled by default. Use --checkpointing_steps {STEPS}
to enable checkpointing. To resume training from a checkpoint, add
--resume_from_checkpoint
.
Limitations¶
- FP16 training with HuggingFace Accelerate is faster than FP32 but still uses the same amount of memory.