espnet2.svs.espnet_model.ESPnetSVSModel
espnet2.svs.espnet_model.ESPnetSVSModel
class espnet2.svs.espnet_model.ESPnetSVSModel(text_extract: AbsFeatsExtract | None, feats_extract: AbsFeatsExtract | None, score_feats_extract: AbsFeatsExtract | None, label_extract: AbsFeatsExtract | None, pitch_extract: AbsFeatsExtract | None, ying_extract: AbsFeatsExtract | None, duration_extract: AbsFeatsExtract | None, energy_extract: AbsFeatsExtract | None, normalize: InversibleInterface | None, pitch_normalize: InversibleInterface | None, energy_normalize: InversibleInterface | None, svs: AbsSVS)
Bases: AbsESPnetModel
ESPnet model for singing voice synthesis task.
This model is designed for singing voice synthesis using various feature extraction techniques. It processes text and singing waveforms to produce outputs for the synthesis task.
text_extract
Feature extractor for text.
- Type: Optional[AbsFeatsExtract]
feats_extract
Feature extractor for audio.
- Type: Optional[AbsFeatsExtract]
score_feats_extract
Feature extractor for score features.
- Type: Optional[AbsFeatsExtract]
label_extract
Feature extractor for labels.
- Type: Optional[AbsFeatsExtract]
pitch_extract
Feature extractor for pitch.
- Type: Optional[AbsFeatsExtract]
ying_extract
Feature extractor for ying.
- Type: Optional[AbsFeatsExtract]
duration_extract
Feature extractor for duration.
- Type: Optional[AbsFeatsExtract]
energy_extract
Feature extractor for energy.
- Type: Optional[AbsFeatsExtract]
normalize
Normalization layer for features.
- Type: Optional[AbsNormalize and InversibleInterface]
pitch_normalize
Normalization layer for pitch.
- Type: Optional[AbsNormalize and InversibleInterface]
energy_normalize
Normalization layer for energy.
- Type: Optional[AbsNormalize and InversibleInterface]
svs
The main singing voice synthesis model.
Type:AbsSVS
Parameters:
- text_extract (Optional [AbsFeatsExtract ]) – Feature extractor for text.
- feats_extract (Optional [AbsFeatsExtract ]) – Feature extractor for audio.
- score_feats_extract (Optional [AbsFeatsExtract ]) – Feature extractor for score features.
- label_extract (Optional [AbsFeatsExtract ]) – Feature extractor for labels.
- pitch_extract (Optional [AbsFeatsExtract ]) – Feature extractor for pitch.
- ying_extract (Optional [AbsFeatsExtract ]) – Feature extractor for ying.
- duration_extract (Optional [AbsFeatsExtract ]) – Feature extractor for duration.
- energy_extract (Optional [AbsFeatsExtract ]) – Feature extractor for energy.
- normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Normalization layer for features.
- pitch_normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Normalization layer for pitch.
- energy_normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Normalization layer for energy.
- svs (AbsSVS) – The main singing voice synthesis model.
####### Examples
>>> model = ESPnetSVSModel(text_extract=text_feature_extractor,
... feats_extract=audio_feature_extractor,
... svs=svs_model)
>>> output = model(text_tensor, text_lengths_tensor, singing_tensor,
... singing_lengths_tensor)
- Raises:RuntimeError – If the score feature extractor type is not recognized.
Initialize ESPnetSVSModel module.
collect_feats(text: Tensor, text_lengths: Tensor, singing: Tensor, singing_lengths: Tensor, label: Tensor | None = None, label_lengths: Tensor | None = None, phn_cnt: Tensor | None = None, midi: Tensor | None = None, midi_lengths: Tensor | None = None, duration_phn: Tensor | None = None, duration_phn_lengths: Tensor | None = None, duration_ruled_phn: Tensor | None = None, duration_ruled_phn_lengths: Tensor | None = None, duration_syb: Tensor | None = None, duration_syb_lengths: Tensor | None = None, slur: Tensor | None = None, slur_lengths: Tensor | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, energy: Tensor | None = None, energy_lengths: Tensor | None = None, ying: Tensor | None = None, ying_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, **kwargs) → Dict[str, Tensor]
Caclualte features and return them as a dict.
- Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- singing (Tensor) – Singing waveform tensor (B, T_wav).
- singing_lengths (Tensor) – Singing length tensor (B,).
- label (Option *[*Tensor ]) – Label tensor (B, T_label).
- label_lengths (Optional *[*Tensor ]) – Label lrngth tensor (B,).
- phn_cnt (Optional *[*Tensor ]) – Number of phones in each syllable (B, T_syb)
- midi (Option *[*Tensor ]) – Midi tensor (B, T_label).
- midi_lengths (Optional *[*Tensor ]) – Midi lrngth tensor (B,).
- ---- ( ---- duration* is duration in time_shift)
- duration_phn (Optional *[*Tensor ]) – duration tensor (B, T_label).
- duration_phn_lengths (Optional *[*Tensor ]) – duration length tensor (B,).
- duration_ruled_phn (Optional *[*Tensor ]) – duration tensor (B, T_phone).
- duration_ruled_phn_lengths (Optional *[*Tensor ]) – duration length tensor (B,).
- duration_syb (Optional *[*Tensor ]) – duration tensor (B, T_syb).
- duration_syb_lengths (Optional *[*Tensor ]) – duration length tensor (B,).
- slur (Optional *[*Tensor ]) – slur tensor (B, T_slur).
- slur_lengths (Optional *[*Tensor ]) – slur length tensor (B,).
- pitch (Optional *[*Tensor ]) – Pitch tensor (B, T_wav). - f0 sequence
- pitch_lengths (Optional *[*Tensor ]) – Pitch length tensor (B,).
- energy (Optional *[*Tensor) – Energy tensor.
- energy_lengths (Optional *[*Tensor) – Energy length tensor (B,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, D).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (B, 1).
- lids (Optional *[*Tensor ]) – Language ID tensor (B, 1).
- Returns: Dict of features.
- Return type: Dict[str, Tensor]
forward(text: Tensor, text_lengths: Tensor, singing: Tensor, singing_lengths: Tensor, feats: Tensor | None = None, feats_lengths: Tensor | None = None, label: Tensor | None = None, label_lengths: Tensor | None = None, phn_cnt: Tensor | None = None, midi: Tensor | None = None, midi_lengths: Tensor | None = None, duration_phn: Tensor | None = None, duration_phn_lengths: Tensor | None = None, duration_ruled_phn: Tensor | None = None, duration_ruled_phn_lengths: Tensor | None = None, duration_syb: Tensor | None = None, duration_syb_lengths: Tensor | None = None, slur: Tensor | None = None, slur_lengths: Tensor | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, energy: Tensor | None = None, energy_lengths: Tensor | None = None, ying: Tensor | None = None, ying_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, flag_IsValid=False, **kwargs) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Caclualte outputs and return the loss tensor.
This method computes the forward pass of the ESPnetSVSModel, which includes feature extraction, normalization, and the final loss calculation for the singing voice synthesis task. It processes various input tensors and returns the loss tensor, statistics for monitoring, and a weight tensor.
- Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- singing (Tensor) – Singing waveform tensor (B, T_wav).
- singing_lengths (Tensor) – Singing length tensor (B,).
- feats (Optional *[*Tensor ]) – Features tensor (B, T_feats).
- feats_lengths (Optional *[*Tensor ]) – Lengths of features tensor (B,).
- label (Optional *[*Tensor ]) – Label tensor (B, T_label).
- label_lengths (Optional *[*Tensor ]) – Label length tensor (B,).
- phn_cnt (Optional *[*Tensor ]) – Number of phones in each syllable (B, T_syb).
- midi (Optional *[*Tensor ]) – Midi tensor (B, T_label).
- midi_lengths (Optional *[*Tensor ]) – Midi length tensor (B,).
- duration_phn (Optional *[*Tensor ]) – Duration tensor (B, T_label).
- duration_phn_lengths (Optional *[*Tensor ]) – Duration length tensor (B,).
- duration_ruled_phn (Optional *[*Tensor ]) – Duration tensor (B, T_phone).
- duration_ruled_phn_lengths (Optional *[*Tensor ]) – Duration length tensor (B,).
- duration_syb (Optional *[*Tensor ]) – Duration tensor (B, T_syb).
- duration_syb_lengths (Optional *[*Tensor ]) – Duration length tensor (B,).
- slur (Optional *[*Tensor ]) – Slur tensor (B, T_slur).
- slur_lengths (Optional *[*Tensor ]) – Slur length tensor (B,).
- pitch (Optional *[*Tensor ]) – Pitch tensor (B, T_wav).
- pitch_lengths (Optional *[*Tensor ]) – Pitch length tensor (B,).
- energy (Optional *[*Tensor ]) – Energy tensor.
- energy_lengths (Optional *[*Tensor ]) – Energy length tensor (B,).
- ying (Optional *[*Tensor ]) – Ying tensor.
- ying_lengths (Optional *[*Tensor ]) – Ying length tensor (B,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, D).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (B, 1).
- lids (Optional *[*Tensor ]) – Language ID tensor (B, 1).
- flag_IsValid (bool , optional) – Flag to indicate if the input is valid.
- kwargs – Additional arguments, with “utt_id” being among the input.
- Returns: Loss scalar tensor. Dict[str, float]: Statistics to be monitored. Tensor: Weight tensor to summarize losses.
- Return type: Tensor
####### Examples
Example of using the forward method
loss, stats, weights = model.forward(
text=text_tensor, text_lengths=text_lengths_tensor, singing=singing_tensor, singing_lengths=singing_lengths_tensor, label=label_tensor, label_lengths=label_lengths_tensor, …
)
inference(text: Tensor, singing: Tensor | None = None, label: Tensor | None = None, phn_cnt: Tensor | None = None, midi: Tensor | None = None, duration_phn: Tensor | None = None, duration_ruled_phn: Tensor | None = None, duration_syb: Tensor | None = None, slur: Tensor | None = None, pitch: Tensor | None = None, energy: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, **decode_config) → Dict[str, Tensor]
Caclualte features and return them as a dict.
- Parameters:
- text (Tensor) – Text index tensor (T_text).
- singing (Tensor) – Singing waveform tensor (T_wav).
- label (Option *[*Tensor ]) – Label tensor (T_label).
- phn_cnt (Optional *[*Tensor ]) – Number of phones in each syllable (T_syb)
- midi (Option *[*Tensor ]) – Midi tensor (T_l abel).
- duration_phn (Optional *[*Tensor ]) – duration tensor (T_label).
- duration_ruled_phn (Optional *[*Tensor ]) – duration tensor (T_phone).
- duration_syb (Optional *[*Tensor ]) – duration tensor (T_phone).
- slur (Optional *[*Tensor ]) – slur tensor (T_phone).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (D,).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (1,).
- lids (Optional *[*Tensor ]) – Language ID tensor (1,).
- pitch (Optional *[*Tensor) – Pitch tensor (T_wav).
- energy (Optional *[*Tensor) – Energy tensor.
- Returns: Dict of outputs.
- Return type: Dict[str, Tensor]