BERT Pretraining¶
This guide walks through BERT pretraining based on NVIDIA's configuration. This configuration uses large batch training with LAMB to achieve 64K phase 1 and 32K phase 2 batch sizes.
Setup¶
This guide assumes you have installed the llm
packages and its dependencies as described in the Installation Guide.
You also need the BERT pretraining dataset from NVIDIA's examples.
We will use the configuration is provided in configs/bert-large-lamb.py.
We start by copying the example configuration into our training directory.
Note all of the commands are from the root of the llm-pytorch
directory.
mkdir -p runs/bert-large-pretraining/
cp configs/bert-large-lamb.py runs/bert-large-pretraining/config.py
runs/bert-large-pretraining/config.py
to see if you want
to adjust any options, though the default paths will work.
Running the Trainer¶
There are a number of ways you may launch the trainer:
- Single-GPU for Debugging:
- Multi-GPU Single-Node:
- Multi-Node Multi-GPU:
Typically, you will want to use the last option inside of the script you submit to your batch scheduling system. This is an example submission script for a PBS scheduler.
#!/bin/bash
#PBS -A __ALLOCATION__
#PBS -q __QUEUE__
#PBS -M __EMAIL__
#PBS -m abe
#PBS -l select=16:system=polaris
#PBS -l walltime=6:00:00
#PBS -l filesystems=home:grand
#PBS -j oe
# Figure out training environment based on PBS_NODEFILE existence
if [[ -z "${PBS_NODEFILE}" ]]; then
RANKS=$HOSTNAME
NNODES=1
else
PRIMARY_RANK=$(head -n 1 $PBS_NODEFILE)
RANKS=$(tr '\n' ' ' < $PBS_NODEFILE)
NNODES=$(< $PBS_NODEFILE wc -l)
cat $PBS_NODEFILE
fi
CONFIG="runs/bert-large-pretraining/config.py"
# Commands to run prior to the Python script for setting up the environment
module load cray-python
module load cudatoolkit-standalone/11.7.1
source /path/to/virtualenv
# torchrun launch configuration
LAUNCHER="torchrun "
LAUNCHER+="--nnodes=$NNODES --nproc_per_node=auto --max_restarts 0 "
if [[ "$NNODES" -eq 1 ]]; then
LAUNCHER+="--standalone "
else
LAUNCHER+="--rdzv_backend=c10d --rdzv_endpoint=$PRIMARY_RANK"
fi
# Training script and parameters
CMD="$LAUNCHER -m llm.trainers.bert --config $CONFIG"
echo "Training Command: $CMD"
mpiexec --hostfile $PBS_NODEFILE -np $NNODES --env OMP_NUM_THREADS=8 --cpu-bind none $CMD
Training Phase 1 and 2¶
Train the model for phase 1. After the end of phase 1, you'll see a checkpoint
named runs/bert-large-pretraining/checkpoints/phase-1/global_step_7039.pt
.
To transition to phase 2, set PHASE = 2
in the config file.
Then create a new directory for phase 2 checkpoints at
runs/bert-large-pretraining/checkpoints/phase-2
.
Copy the last checkpoint from phase 1 to the phase 2 directory with the name global_step_0.pt
.
Continue training to complete phase 2/
Converting the Pretrained Model¶
TODO
SQUAD Evaluation¶
TODO