CLI Reference¶
This page provides documentation for our command line tools.
Warning
The usage examples show the executable module the CLI belongs to. To run the CLI, you must execute the module using the Python interpreter. E.g.,
Note
This list is not exhaustive. In particular, the training scripts
provided in the llm.trainers module are not listed here.
llm.preprocess.download¶
Pretraining text downloader.
Usage:
Options:
| Name | Type | Description | Default | 
|---|---|---|---|
| -d,--dataset | choice ( wikipedia|bookscorpus) | Dataset to download. | Sentinel.UNSET | 
| -o,--output-dir | text | Output directory. | Sentinel.UNSET | 
| --log-level | choice ( DEBUG|INFO|WARNING|ERROR) | Minimum logging level. | INFO | 
| --rich/--no-rich | boolean | Use rich output formatting. | False | 
| --help | boolean | Show this message and exit. | False | 
llm.preprocess.roberta¶
Encode FILEPATHS for RoBERTa pretraining.
Usage:
Options:
| Name | Type | Description | Default | 
|---|---|---|---|
| -o,--output-dir | text | Output directory for encoded shards. | Sentinel.UNSET | 
| -t,--tokenizer | text | Path to trained tokenizer to load. | Sentinel.UNSET | 
| -l,--max-seq-len | integer | Maximum sequence length. | 512 | 
| -s,--short-seq-prob | float | Probablity to create shorter sequences. | 0.1 | 
| -p,--processes | integer | Number of processes for concurrent shard encoding. | 4 | 
| --log-level | choice ( DEBUG|INFO|WARNING|ERROR) | Minimum logging level. | INFO | 
| --rich/--no-rich | boolean | Use rich output formatting. | False | 
| --help | boolean | Show this message and exit. | False | 
llm.preprocess.shard¶
Shard documents in FILEPATHS into equally sized files.
Usage:
Options:
| Name | Type | Description | Default | 
|---|---|---|---|
| -o,--output-dir | text | Output directory for encoded shards. | Sentinel.UNSET | 
| -s,--size | text | Max data size of each shard. | Sentinel.UNSET | 
| -f,--format | text | Shard name format where {index} is replaced by shard index. | shard-{index}.txt | 
| --shuffle/--no-shuffle | boolean | Shuffle documents before sharding. | False | 
| --log-level | choice ( DEBUG|INFO|WARNING|ERROR) | Minimum logging level. | INFO | 
| --rich/--no-rich | boolean | Use rich output formatting. | False | 
| --help | boolean | Show this message and exit. | False | 
llm.preprocess.tokenizer¶
Train a tokenizer on FILEPATHS.
Arguments default to the standard uncased BERT with wordpiece method.
Usage:
Options:
| Name | Type | Description | Default | 
|---|---|---|---|
| -o,--output-file | text | Output file to save serialized tokenizer to. | Sentinel.UNSET | 
| -s,--size | integer | Size of vocabulary. | 30522 | 
| -t,--tokenizer | choice ( bpe|wordpiece) | Tokenizer type. | wordpiece | 
| --cased/--uncased | boolean | Vocab/tokenizer is case-sensitive. | False | 
| -s,--special-token | text | Special tokens to prepend to vocab. | ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'] | 
| --log-level | choice ( DEBUG|INFO|WARNING|ERROR) | Minimum logging level. | INFO | 
| --rich/--no-rich | boolean | Use rich output formatting. | False | 
| --help | boolean | Show this message and exit. | False |