llm.preprocess.format
Pretraining text formatting utilities.
combine_document_files
¶
Combine multiple text files into one.
Source code in llm/preprocess/format.py
get_sent_tokenizer
¶
Get a sentence tokenizer.
Returns:
Source code in llm/preprocess/format.py
read_documents_bytes
¶
Read documents from files.
Parameters:
-
files(Iterable[Path | str] | Path | str) –List of files containing documents separated by blank lines to read.
Returns:
Source code in llm/preprocess/format.py
write_documents
¶
Write a list of documents to a file.
Parameters:
-
path(Path | str) –Path to write documents to.
-
documents(list[str]) –Documents to write. Each document will be separated by a blank line.