llm.checkpoint
Checkpoint
¶
Bases: NamedTuple
Data loaded from a checkpoint.
load_checkpoint
¶
load_checkpoint(
checkpoint_dir: str | Path,
global_step: int | None = None,
map_location: Any = None,
) -> Checkpoint | None
Load checkpoint from directory.
Parameters:
-
checkpoint_dir(str | Path) –Directory containing checkpoint files.
-
global_step(int | None, default:None) –Global step checkpoint to load. If
None, loads the latest checkpoint. -
map_location(Any, default:None) –Optional map_location to pass to
torch.load().
Returns:
-
Checkpoint | None–Checkpoint or
Noneif no checkpoint was found.
Raises:
-
OSError–If
checkpoint_dirdoes not exist. -
OSError–If
global_stepis specified but the file does not exist.
Source code in llm/checkpoint.py
save_checkpoint
¶
save_checkpoint(
checkpoint_dir: str | Path,
global_step: int,
model: Module,
optimizer: Optimizer | None = None,
scheduler: _LRScheduler | None = None,
**kwargs: Any
) -> None
Save checkpoint to directory.
Saves the checkpoint as {checkpoint_dir}/global_step_{global_step}.py.
Parameters:
-
checkpoint_dir(str | Path) –Directory to save checkpoint to.
-
global_step(int) –Training step used as the key for checkpoints.
-
model(Module) –Model to save state_dict of.
-
optimizer(Optimizer | None, default:None) –Optional optimizer to save state_dict of.
-
scheduler(_LRScheduler | None, default:None) –Optional scheduler to save state_dict of.
-
kwargs(Any, default:{}) –Additional key-value pairs to add to the checkpoint.