llm.preprocess.format
Pretraining text formatting utilities.
combine_document_files
¶
Combine multiple text files into one.
Source code in llm/preprocess/format.py
get_sent_tokenizer
¶
Get a sentence tokenizer.
Returns:
Source code in llm/preprocess/format.py
read_documents_bytes
¶
Read documents from files.
Parameters:
-
files
(Iterable[Path | str] | Path | str
) –List of files containing documents separated by blank lines to read.
Returns:
Source code in llm/preprocess/format.py
write_documents
¶
Write a list of documents to a file.
Parameters:
-
path
(Path | str
) –Path to write documents to.
-
documents
(list[str]
) –Documents to write. Each document will be separated by a blank line.