espnet2.train.preprocessor.SVSPreprocessor

About 2 min

espnet2.train.preprocessor.SVSPreprocessor

class espnet2.train.preprocessor.SVSPreprocessor(train: bool, token_type: str | None = None, token_list: Path | str | Iterable[str] | None = None, bpemodel: Path | str | Iterable[str] | None = None, text_cleaner: Collection[str] | None = None, g2p_type: str | None = None, unk_symbol: str = '<unk>', space_symbol: str = '<space>', non_linguistic_symbols: Path | str | Iterable[str] | None = None, delimiter: str | None = None, singing_volume_normalize: float | None = None, singing_name: str = 'singing', text_name: str = 'text', label_name: str = 'label', midi_name: str = 'score', fs: int32 = 0, hop_length: int32 = 256, phn_seg: dict = {1: [1], 2: [0.25, 1], 3: [0.1, 0.5, 1], 4: [0.05, 0.1, 0.5, 1]})

Bases: AbsPreprocessor

Preprocessor for Sing Voice Synthesis (SVS) task.

This class handles the preprocessing steps for data used in singing voice synthesis tasks, including text cleaning, tokenization, and normalization of singing audio signals.

train

Indicates whether the preprocessor is in training mode.

Type: bool

singing_name

Key in the data dictionary for the singing audio.

Type: str

text_name

Key in the data dictionary for the text input.

Type: str

label_name

Key in the data dictionary for the label information.

Type: str

midi_name

Key in the data dictionary for the MIDI score.

Type: str

Sampling rate of the audio.

Type: np.int32

hop_length

Hop length for audio processing.

Type: np.int32

singing_volume_normalize

Factor for normalizing singing volume.

Type: float

phn_seg

Segmentation rules for phonemes.

Type: dict
Parameters:
- train (bool) – Whether to use in training mode.
- token_type (Optional *[*str ]) – Type of tokenizer to use.
- token_list (Union *[*Path , str , Iterable *[*str ] ]) – Path or list of tokens.
- bpemodel (Union *[*Path , str , Iterable *[*str ] ]) – BPE model for tokenization.
- text_cleaner (Collection *[*str ]) – List of cleaning functions for text.
- g2p_type (Optional *[*str ]) – Type of grapheme-to-phoneme conversion.
- unk_symbol (str) – Symbol used for unknown tokens.
- space_symbol (str) – Symbol used for spaces in text.
- non_linguistic_symbols (Union *[*Path , str , Iterable *[*str ] ]) – Path or list of non-linguistic symbols to handle.
- delimiter (Optional *[*str ]) – Delimiter for separating tokens in text.
- singing_volume_normalize (float) – Normalization factor for singing audio.
- singing_name (str) – Key for singing audio in data dictionary.
- text_name (str) – Key for text in data dictionary.
- label_name (str) – Key for labels in data dictionary.
- midi_name (str) – Key for MIDI score in data dictionary.
- fs (np.int32) – Sampling rate for audio.
- hop_length (np.int32) – Hop length for audio processing.
- phn_seg (dict) – Segmentation rules for phonemes.

Examples

>>> preprocessor = SVSPreprocessor(train=True, token_type="bpe",
...                                 token_list=['&lt;unk&gt;', '&lt;space&gt;'],
...                                 bpemodel="path/to/bpe/model")
>>> data = {
...     "singing": np.random.rand(1000),  # Example singing audio
...     "label": (np.array([[0, 1]]), ["a", "b", "c"]),
...     "score": (120, [[0, 1, "A", 60, "a_b"]])
... }
>>> processed_data = preprocessor("example_uid", data)