Skip to content

CLI Reference

This page provides documentation for our command line tools.

Warning

The usage examples show the executable module the CLI belongs to. To run the CLI, you must execute the module using the Python interpreter. E.g.,

$ python -m llm.preprocess.download --help

Note

This list is not exhaustive. In particular, the training scripts provided in the llm.trainers module are not listed here.

llm.preprocess.download

Pretraining text downloader.

Usage:

llm.preprocess.download [OPTIONS]

Options:

Name Type Description Default
-d, --dataset choice (wikipedia | bookscorpus) Dataset to download. _required
-o, --output-dir text Output directory. _required
--log-level choice (DEBUG | INFO | WARNING | ERROR) Minimum logging level. INFO
--rich / --no-rich boolean Use rich output formatting. False
--help boolean Show this message and exit. False

llm.preprocess.roberta

Encode FILEPATHS for RoBERTa pretraining.

Usage:

llm.preprocess.roberta [OPTIONS] FILEPATHS

Options:

Name Type Description Default
-o, --output-dir text Output directory for encoded shards. _required
-t, --tokenizer text Path to trained tokenizer to load. _required
-l, --max-seq-len integer Maximum sequence length. 512
-s, --short-seq-prob float Probablity to create shorter sequences. 0.1
-p, --processes integer Number of processes for concurrent shard encoding. 4
--log-level choice (DEBUG | INFO | WARNING | ERROR) Minimum logging level. INFO
--rich / --no-rich boolean Use rich output formatting. False
--help boolean Show this message and exit. False

llm.preprocess.shard

Shard documents in FILEPATHS into equally sized files.

Usage:

llm.preprocess.shard [OPTIONS] FILEPATHS

Options:

Name Type Description Default
-o, --output-dir text Output directory for encoded shards. _required
-s, --size text Max data size of each shard. _required
-f, --format text Shard name format where {index} is replaced by shard index. shard-{index}.txt
--shuffle / --no-shuffle boolean Shuffle documents before sharding. False
--log-level choice (DEBUG | INFO | WARNING | ERROR) Minimum logging level. INFO
--rich / --no-rich boolean Use rich output formatting. False
--help boolean Show this message and exit. False

llm.preprocess.tokenizer

Train a tokenizer on FILEPATHS.

Arguments default to the standard uncased BERT with wordpiece method.

Usage:

llm.preprocess.tokenizer [OPTIONS] FILEPATHS

Options:

Name Type Description Default
-o, --output-file text Output file to save serialized tokenizer to. _required
-s, --size integer Size of vocabulary. 30522
-t, --tokenizer choice (bpe | wordpiece) Tokenizer type. wordpiece
--cased / --uncased boolean Vocab/tokenizer is case-sensitive. False
-s, --special-token text Special tokens to prepend to vocab. ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']
--log-level choice (DEBUG | INFO | WARNING | ERROR) Minimum logging level. INFO
--rich / --no-rich boolean Use rich output formatting. False
--help boolean Show this message and exit. False