llm.checkpoint
Checkpoint
¶
Bases: NamedTuple
Data loaded from a checkpoint.
load_checkpoint
¶
load_checkpoint(
checkpoint_dir: str | Path,
global_step: int | None = None,
map_location: Any = None,
) -> Checkpoint | None
Load checkpoint from directory.
Parameters:
-
checkpoint_dir
(str | Path
) –Directory containing checkpoint files.
-
global_step
(int | None
, default:None
) –Global step checkpoint to load. If
None
, loads the latest checkpoint. -
map_location
(Any
, default:None
) –Optional map_location to pass to
torch.load()
.
Returns:
-
Checkpoint | None
–Checkpoint or
None
if no checkpoint was found.
Raises:
-
OSError
–If
checkpoint_dir
does not exist. -
OSError
–If
global_step
is specified but the file does not exist.
Source code in llm/checkpoint.py
save_checkpoint
¶
save_checkpoint(
checkpoint_dir: str | Path,
global_step: int,
model: Module,
optimizer: Optimizer | None = None,
scheduler: _LRScheduler | None = None,
**kwargs: Any
) -> None
Save checkpoint to directory.
Saves the checkpoint as {checkpoint_dir}/global_step_{global_step}.py
.
Parameters:
-
checkpoint_dir
(str | Path
) –Directory to save checkpoint to.
-
global_step
(int
) –Training step used as the key for checkpoints.
-
model
(Module
) –Model to save state_dict of.
-
optimizer
(Optimizer | None
, default:None
) –Optional optimizer to save state_dict of.
-
scheduler
(_LRScheduler | None
, default:None
) –Optional scheduler to save state_dict of.
-
kwargs
(Any
, default:{}
) –Additional key-value pairs to add to the checkpoint.