llm.preprocess.format
Pretraining text formatting utilities.
combine_document_files() ¶
combine_document_files(
filepaths: Iterable[pathlib.Path | str],
output_file: pathlib.Path | str,
) -> None
Combine multiple text files into one.
Source code in llm/preprocess/format.py
get_sent_tokenizer() ¶
Get a sentence tokenizer.
Returns:
Source code in llm/preprocess/format.py
read_documents_bytes() ¶
read_documents_bytes(
files: (
Iterable[pathlib.Path | str] | pathlib.Path | str
),
) -> list[bytes]
Read documents from files.
Parameters:
-
files
(Iterable[Path | str] | Path | str
) –List of files containing documents separated by blank lines to read.
Returns:
Source code in llm/preprocess/format.py
write_documents() ¶
Write a list of documents to a file.
Parameters:
-
path
(Path | str
) –Path to write documents to.
-
documents
(list[str]
) –Documents to write. Each document will be separated by a blank line.