CLI Reference¶

This page provides documentation for our command line tools.

Warning

The usage examples show the executable module the CLI belongs to. To run the CLI, you must execute the module using the Python interpreter. E.g.,

$ python -m llm.preprocess.download --help

Note

This list is not exhaustive. In particular, the training scripts provided in the llm.trainers module are not listed here.

llm.preprocess.download¶

Pretraining text downloader.

Usage:

llm.preprocess.download [OPTIONS]

Options:

Name	Type	Description	Default
`-d`, `--dataset`	choice (`wikipedia` \| `bookscorpus`)	Dataset to download.	_required
`-o`, `--output-dir`	text	Output directory.	_required
`--log-level`	choice (`DEBUG` \| `INFO` \| `WARNING` \| `ERROR`)	Minimum logging level.	`INFO`
`--rich` / `--no-rich`	boolean	Use rich output formatting.	`False`
`--help`	boolean	Show this message and exit.	`False`

Encode FILEPATHS for RoBERTa pretraining.

Usage:

llm.preprocess.roberta [OPTIONS] FILEPATHS

Options:

Name	Type	Description	Default
`-o`, `--output-dir`	text	Output directory for encoded shards.	_required
`-t`, `--tokenizer`	text	Path to trained tokenizer to load.	_required
`-l`, `--max-seq-len`	integer	Maximum sequence length.	`512`
`-s`, `--short-seq-prob`	float	Probablity to create shorter sequences.	`0.1`
`-p`, `--processes`	integer	Number of processes for concurrent shard encoding.	`4`
`--log-level`	choice (`DEBUG` \| `INFO` \| `WARNING` \| `ERROR`)	Minimum logging level.	`INFO`
`--rich` / `--no-rich`	boolean	Use rich output formatting.	`False`
`--help`	boolean	Show this message and exit.	`False`

Shard documents in FILEPATHS into equally sized files.

Usage:

llm.preprocess.shard [OPTIONS] FILEPATHS

Options:

Name	Type	Description	Default
`-o`, `--output-dir`	text	Output directory for encoded shards.	_required
`-s`, `--size`	text	Max data size of each shard.	_required
`-f`, `--format`	text	Shard name format where {index} is replaced by shard index.	`shard-{index}.txt`
`--shuffle` / `--no-shuffle`	boolean	Shuffle documents before sharding.	`False`
`--log-level`	choice (`DEBUG` \| `INFO` \| `WARNING` \| `ERROR`)	Minimum logging level.	`INFO`
`--rich` / `--no-rich`	boolean	Use rich output formatting.	`False`
`--help`	boolean	Show this message and exit.	`False`

Train a tokenizer on FILEPATHS.

Arguments default to the standard uncased BERT with wordpiece method.

Usage:

llm.preprocess.tokenizer [OPTIONS] FILEPATHS

Options:

Name	Type	Description	Default
`-o`, `--output-file`	text	Output file to save serialized tokenizer to.	_required
`-s`, `--size`	integer	Size of vocabulary.	`30522`
`-t`, `--tokenizer`	choice (`bpe` \| `wordpiece`)	Tokenizer type.	`wordpiece`
`--cased` / `--uncased`	boolean	Vocab/tokenizer is case-sensitive.	`False`
`-s`, `--special-token`	text	Special tokens to prepend to vocab.	`['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]']`
`--log-level`	choice (`DEBUG` \| `INFO` \| `WARNING` \| `ERROR`)	Minimum logging level.	`INFO`
`--rich` / `--no-rich`	boolean	Use rich output formatting.	`False`
`--help`	boolean	Show this message and exit.	`False`