espnet3.systems.asr.tokenizer.sentencepiece.prepare_sentences

Less than 1 minute

espnet3.systems.asr.tokenizer.sentencepiece.prepare_sentences

espnet3.systems.asr.tokenizer.sentencepiece.prepare_sentences(dump_text_paths: List[str | Path], output_path: str | Path, remove_characters: str = '')

Create a training text file for SentencePiece model training from the provided dump text files.

This function consolidates multiple text files into a single train.txt file, which is formatted for use in SentencePiece training. It also provides an option to remove specified characters from the text before writing to the output file.

Parameters:
- dump_text_paths (Union *[*str , Path ]) – A single dump text file path or a list of paths to the dump text files that will be processed.
- output_path (Union *[*str , Path ]) – The directory where the train.txt file will be saved. If the directory does not exist, it will be created.
- remove_characters (str , optional) – A string containing characters to be removed from the text. Defaults to an empty string, meaning no characters will be removed.
Raises:
- FileNotFoundError – If any of the dump text files do not exist.
- IOError – If there is an error reading from the dump text files
- or writing to the output path. –

Examples

>>> prepare_sentences("data/dump.txt", "output", remove_characters=",.!")
This will create an `output/train.txt` file from `data/dump.txt`,
removing commas, periods, and exclamation marks from the text.

>>> prepare_sentences(["data/dump1.txt", "data/dump2.txt"], "output")
This will create an `output/train.txt` file by concatenating
`data/dump1.txt` and `data/dump2.txt` without removing any characters.

NOTE

Ensure that the input dump text files are properly formatted, as the function expects each line to have a space-separated format where the text to be processed is after the first space.