espnet3.systems.asr.tokenizers.sentencepiece.prepare_sentences
Less than 1 minute
espnet3.systems.asr.tokenizers.sentencepiece.prepare_sentences
espnet3.systems.asr.tokenizers.sentencepiece.prepare_sentences(dump_text_paths: List[str | Path], output_path: str | Path, remove_characters: str = '')
Create a SentencePiece training text file from dump text files.
This function consolidates multiple text files into a single train.txt file, which is formatted for use in SentencePiece training. It also provides an option to remove specified characters from the text before writing to the output file.
- Parameters:
- dump_text_paths (Union *[*str , Path ]) – A single dump text file path or a list of paths to the dump text files that will be processed.
- output_path (Union *[*str , Path ]) – The directory where the
train.txtfile will be saved. If the directory does not exist, it will be created. - remove_characters (str , optional) – A string containing characters to be removed from the text. Defaults to an empty string, meaning no characters will be removed.
- Raises:
- FileNotFoundError – If any of the dump text files do not exist.
- IOError – If there is an error reading from the dump text files
- or writing to the output path. –
Examples
>>> prepare_sentences("data/dump.txt", "output", remove_characters=",.!")
This will create an ``output/train.txt`` file from ``data/dump.txt``,
removing commas, periods, and exclamation marks from the text.>>> prepare_sentences(["data/dump1.txt", "data/dump2.txt"], "output")
This will create an ``output/train.txt`` file by concatenating
``data/dump1.txt`` and ``data/dump2.txt`` without removing any characters.NOTE
Ensure that the input dump text files are properly formatted, as the function expects each line to have a space-separated format where the text to be processed is after the first space.
