espnet3.systems.asr.tokenizers.sentencepiece.prepare_sentences
Less than 1 minute
espnet3.systems.asr.tokenizers.sentencepiece.prepare_sentences
espnet3.systems.asr.tokenizers.sentencepiece.prepare_sentences(dump_text_paths: List[str | Path], output_path: str | Path, remove_characters: str = '')
Create a SentencePiece training text file from dump text files.
This function consolidates multiple text files into a single train.txt file, which is formatted for use in SentencePiece training. It also provides an option to remove specified characters from the text before writing to the output file.
- Parameters:
- dump_text_paths (List *[*Union *[*str , Path ] ]) – Paths to dump text files that will be concatenated. Each line is expected to be
<utt_id><space><text>and the text after the first space is used for training. - output_path (Union *[*str , Path ]) – The directory where the train.txt file will be saved. If the directory does not exist, it will be created.
- remove_characters (str , optional) – A string containing characters to be removed from the text. Defaults to an empty string, meaning no characters will be removed.
- dump_text_paths (List *[*Union *[*str , Path ] ]) – Paths to dump text files that will be concatenated. Each line is expected to be
- Returns: None
- Raises:
- FileNotFoundError – If any of the dump text files do not exist.
- IOError – If there is an error reading from the dump text files
- or writing to the output path. –
Example
>>> prepare_sentences(
... dump_text_paths=["dump/train.txt"],
... output_path="tokenizer",
... remove_characters=",.!",
... ) # writes tokenizer/train.txtNotes
- This helper writes plain text (one sentence per line) for use with
train_sentencepiece().
