espnet2.train.preprocessor.SVSPreprocessor
espnet2.train.preprocessor.SVSPreprocessor
class espnet2.train.preprocessor.SVSPreprocessor(train: bool, token_type: str | None = None, token_list: Path | str | Iterable[str] | None = None, bpemodel: Path | str | Iterable[str] | None = None, text_cleaner: Collection[str] | None = None, g2p_type: str | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, delimiter: str | None = None, singing_volume_normalize: float | None = None, singing_name: str = 'singing', text_name: str = 'text', label_name: str = 'label', midi_name: str = 'score', fs: int32 = 0, hop_length: int32 = 256, phn_seg: dict = {1: [1], 2: [0.25, 1], 3: [0.1, 0.5, 1], 4: [0.05, 0.1, 0.5, 1]})
Bases: AbsPreprocessor
Preprocessor for Sing Voice Synthesis (SVS) task.
This class handles the preprocessing steps for data used in singing voice synthesis tasks, including text cleaning, tokenization, and normalization of singing audio signals.
train
Indicates whether the preprocessor is in training mode.
- Type: bool
singing_name
Key in the data dictionary for the singing audio.
- Type: str
text_name
Key in the data dictionary for the text input.
- Type: str
label_name
Key in the data dictionary for the label information.
- Type: str
midi_name
Key in the data dictionary for the MIDI score.
- Type: str
fs
Sampling rate of the audio.
- Type: np.int32
hop_length
Hop length for audio processing.
- Type: np.int32
singing_volume_normalize
Factor for normalizing singing volume.
- Type: float
phn_seg
Segmentation rules for phonemes.
Type: dict
Parameters:
- train (bool) – Whether to use in training mode.
- token_type (Optional *[*str ]) – Type of tokenizer to use.
- token_list (Union *[*Path , str , Iterable *[*str ] ]) – Path or list of tokens.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ]) – BPE model for tokenization.
- text_cleaner (Collection *[*str ]) – List of cleaning functions for text.
- g2p_type (Optional *[*str ]) – Type of grapheme-to-phoneme conversion.
- unk_symbol (str) – Symbol used for unknown tokens.
- space_symbol (str) – Symbol used for spaces in text.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ]) – Path or list of non-linguistic symbols to handle.
- delimiter (Optional *[*str ]) – Delimiter for separating tokens in text.
- singing_volume_normalize (float) – Normalization factor for singing audio.
- singing_name (str) – Key for singing audio in data dictionary.
- text_name (str) – Key for text in data dictionary.
- label_name (str) – Key for labels in data dictionary.
- midi_name (str) – Key for MIDI score in data dictionary.
- fs (np.int32) – Sampling rate for audio.
- hop_length (np.int32) – Hop length for audio processing.
- phn_seg (dict) – Segmentation rules for phonemes.
Examples
>>> preprocessor = SVSPreprocessor(train=True, token_type="bpe",
... token_list=['<unk>', '<space>'],
... bpemodel="path/to/bpe/model")
>>> data = {
... "singing": np.random.rand(1000), # Example singing audio
... "label": (np.array([[0, 1]]), ["a", "b", "c"]),
... "score": (120, [[0, 1, "A", 60, "a_b"]])
... }
>>> processed_data = preprocessor("example_uid", data)