espnet3.systems.asr.tokenizer.sentencepiece.prepare_sentences
Less than 1 minute
espnet3.systems.asr.tokenizer.sentencepiece.prepare_sentences
espnet3.systems.asr.tokenizer.sentencepiece.prepare_sentences(dump_text_paths: List[str | Path], output_path: str | Path, remove_characters: str = '')
Create a training text file for SentencePiece model training from the provided dump text files.
This function consolidates multiple text files into a single train.txt file, which is formatted for use in SentencePiece training. It also provides an option to remove specified characters from the text before writing to the output file.
- Parameters:
- dump_text_paths (Union *[*str , Path ]) β A single dump text file path or a list of paths to the dump text files that will be processed.
- output_path (Union *[*str , Path ]) β The directory where the train.txt file will be saved. If the directory does not exist, it will be created.
- remove_characters (str , optional) β A string containing characters to be removed from the text. Defaults to an empty string, meaning no characters will be removed.
- Raises:
- FileNotFoundError β If any of the dump text files do not exist.
- IOError β If there is an error reading from the dump text files
- or writing to the output path. β
Examples
>>> prepare_sentences("data/dump.txt", "output", remove_characters=",.!")
This will create an `output/train.txt` file from `data/dump.txt`,
removing commas, periods, and exclamation marks from the text.>>> prepare_sentences(["data/dump1.txt", "data/dump2.txt"], "output")
This will create an `output/train.txt` file by concatenating
`data/dump1.txt` and `data/dump2.txt` without removing any characters.NOTE
Ensure that the input dump text files are properly formatted, as the function expects each line to have a space-separated format where the text to be processed is after the first space.
