espnet2.gan_svs.espnet_model.ESPnetGANSVSModel
espnet2.gan_svs.espnet_model.ESPnetGANSVSModel
class espnet2.gan_svs.espnet_model.ESPnetGANSVSModel(postfrontend: AbsFrontend | None, text_extract: AbsFeatsExtract | None, feats_extract: AbsFeatsExtract | None, score_feats_extract: AbsFeatsExtract | None, label_extract: AbsFeatsExtract | None, pitch_extract: AbsFeatsExtract | None, ying_extract: AbsFeatsExtract | None, duration_extract: AbsFeatsExtract | None, energy_extract: AbsFeatsExtract | None, normalize: InversibleInterface | None, pitch_normalize: InversibleInterface | None, energy_normalize: InversibleInterface | None, svs: AbsGANSVS)
Bases: AbsGANESPnetModel
ESPnet model for GAN-based singing voice synthesis task.
This model utilizes a Generative Adversarial Network (GAN) architecture for the task of singing voice synthesis (SVS). It processes input text, singing audio, and various feature tensors to generate synthetic singing voices. The model comprises components for feature extraction, normalization, and GAN-based synthesis.
text_extract
Feature extractor for text.
- Type: Optional[AbsFeatsExtract]
feats_extract
Feature extractor for singing.
- Type: Optional[AbsFeatsExtract]
score_feats_extract
Feature extractor for score-related features.
- Type: Optional[AbsFeatsExtract]
label_extract
Feature extractor for labels.
- Type: Optional[AbsFeatsExtract]
pitch_extract
Feature extractor for pitch.
- Type: Optional[AbsFeatsExtract]
duration_extract
Feature extractor for duration.
- Type: Optional[AbsFeatsExtract]
energy_extract
Feature extractor for energy.
- Type: Optional[AbsFeatsExtract]
ying_extract
Feature extractor for ying.
- Type: Optional[AbsFeatsExtract]
normalize
Normalization layer for features.
- Type: Optional[AbsNormalize and InversibleInterface]
pitch_normalize
Normalization layer for pitch features.
- Type: Optional[AbsNormalize and InversibleInterface]
energy_normalize
Normalization layer for energy features.
- Type: Optional[AbsNormalize and InversibleInterface]
svs
The main GAN-based SVS component.
- Type:AbsGANSVS
postfrontend
Post-processing frontend for feature extraction.
Type: Optional[AbsFrontend]
Parameters:
- postfrontend (Optional [AbsFrontend ]) – Post-processing frontend.
- text_extract (Optional [AbsFeatsExtract ]) – Text feature extractor.
- feats_extract (Optional [AbsFeatsExtract ]) – Singing feature extractor.
- score_feats_extract (Optional [AbsFeatsExtract ]) – Score feature extractor.
- label_extract (Optional [AbsFeatsExtract ]) – Label feature extractor.
- pitch_extract (Optional [AbsFeatsExtract ]) – Pitch feature extractor.
- ying_extract (Optional [AbsFeatsExtract ]) – Ying feature extractor.
- duration_extract (Optional [AbsFeatsExtract ]) – Duration feature extractor.
- energy_extract (Optional [AbsFeatsExtract ]) – Energy feature extractor.
- normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Feature normalization.
- pitch_normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Pitch normalization.
- energy_normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Energy normalization.
- svs (AbsGANSVS) – GAN-based SVS component.
Raises:AssertionError – If the svs does not have ‘generator’ or ‘discriminator’ attributes.
########### Examples
Example of creating an ESPnetGANSVSModel instance
model = ESPnetGANSVSModel(
postfrontend=None, text_extract=my_text_extract, feats_extract=my_feats_extract, score_feats_extract=my_score_feats_extract, label_extract=my_label_extract, pitch_extract=my_pitch_extract, duration_extract=my_duration_extract, energy_extract=my_energy_extract, normalize=my_normalize, pitch_normalize=my_pitch_normalize, energy_normalize=my_energy_normalize, svs=my_svs
)
####### NOTE Ensure that the svs object has generator and discriminator attributes to function correctly with this model.
Initialize ESPnetGANSVSModel module.
collect_feats(text: Tensor, text_lengths: Tensor, singing: Tensor, singing_lengths: Tensor, label: Tensor | None = None, label_lengths: Tensor | None = None, phn_cnt: Tensor | None = None, midi: Tensor | None = None, midi_lengths: Tensor | None = None, duration_phn: Tensor | None = None, duration_phn_lengths: Tensor | None = None, duration_ruled_phn: Tensor | None = None, duration_ruled_phn_lengths: Tensor | None = None, duration_syb: Tensor | None = None, duration_syb_lengths: Tensor | None = None, slur: Tensor | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, energy: Tensor | None = None, energy_lengths: Tensor | None = None, ying: Tensor | None = None, ying_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, **kwargs) → Dict[str, Tensor]
Calculate features and return them as a dict.
This function extracts various features from the provided singing waveform and related tensors. It computes and returns a dictionary containing the extracted features, such as singing features, pitch, energy, and additional features if available.
- Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- singing (Tensor) – Singing waveform tensor (B, T_wav).
- singing_lengths (Tensor) – Singing length tensor (B,).
- label (Optional *[*Tensor ]) – Label tensor (B, T_label).
- label_lengths (Optional *[*Tensor ]) – Label length tensor (B,).
- phn_cnt (Optional *[*Tensor ]) – Number of phones in each syllable (B, T_syb).
- midi (Optional *[*Tensor ]) – Midi tensor (B, T_label).
- midi_lengths (Optional *[*Tensor ]) – Midi length tensor (B,).
- duration_phn (Optional *[*Tensor ]) – Duration tensor (B, T_label).
- duration_phn_lengths (Optional *[*Tensor ]) – Duration length tensor (B,).
- duration_ruled_phn (Optional *[*Tensor ]) – Duration tensor (B, T_phone).
- duration_ruled_phn_lengths (Optional *[*Tensor ]) – Duration length tensor (B,).
- duration_syb (Optional *[*Tensor ]) – Duration tensor (B, T_syllable).
- duration_syb_lengths (Optional *[*Tensor ]) – Duration length tensor (B,).
- slur (Optional *[*Tensor ]) – Slur tensor (B, T_slur).
- pitch (Optional *[*Tensor ]) – Pitch tensor (B, T_wav). - f0 sequence.
- pitch_lengths (Optional *[*Tensor ]) – Pitch length tensor (B,).
- energy (Optional *[*Tensor ]) – Energy tensor.
- energy_lengths (Optional *[*Tensor ]) – Energy length tensor (B,).
- ying (Optional *[*Tensor ]) – Ying tensor.
- ying_lengths (Optional *[*Tensor ]) – Ying length tensor (B,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, D).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (B, 1).
- lids (Optional *[*Tensor ]) – Language ID tensor (B, 1).
- Returns: A dictionary containing the extracted features such as ‘feats’, ‘feats_lengths’, ‘pitch’, ‘pitch_lengths’, ‘energy’, ‘energy_lengths’, and ‘ying’, ‘ying_lengths’ if they were computed.
- Return type: Dict[str, Tensor]
########### Examples
>>> model = ESPnetGANSVSModel(...)
>>> feats_dict = model.collect_feats(
... text=torch.tensor([[1, 2, 3]]),
... text_lengths=torch.tensor([3]),
... singing=torch.randn(1, 16000),
... singing_lengths=torch.tensor([16000]),
... pitch=torch.randn(1, 16000),
... energy=torch.randn(1, 16000)
... )
>>> print(feats_dict.keys())
dict_keys(['feats', 'feats_lengths', 'pitch', 'pitch_lengths',
'energy', 'energy_lengths'])
forward(text: Tensor, text_lengths: Tensor, singing: Tensor, singing_lengths: Tensor, feats: Tensor | None = None, feats_lengths: Tensor | None = None, label: Tensor | None = None, label_lengths: Tensor | None = None, phn_cnt: Tensor | None = None, midi: Tensor | None = None, midi_lengths: Tensor | None = None, duration_phn: Tensor | None = None, duration_phn_lengths: Tensor | None = None, duration_ruled_phn: Tensor | None = None, duration_ruled_phn_lengths: Tensor | None = None, duration_syb: Tensor | None = None, duration_syb_lengths: Tensor | None = None, slur: Tensor | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, energy: Tensor | None = None, energy_lengths: Tensor | None = None, ying: Tensor | None = None, ying_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, forward_generator: bool = True, **kwargs) → Dict[str, Any]
Return generator or discriminator loss with dict format.
This method processes the input tensors, extracts necessary features, and computes the loss for either the generator or discriminator based on the provided inputs.
- Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- singing (Tensor) – Singing waveform tensor (B, T_wav).
- singing_lengths (Tensor) – Singing length tensor (B,).
- feats (Optional *[*Tensor ]) – Feature tensor (B, T_feats).
- feats_lengths (Optional *[*Tensor ]) – Feature lengths tensor (B,).
- label (Optional *[*Tensor ]) – Label tensor (B, T_label).
- label_lengths (Optional *[*Tensor ]) – Label length tensor (B,).
- phn_cnt (Optional *[*Tensor ]) – Number of phones in each syllable (B, T_syb).
- midi (Optional *[*Tensor ]) – Midi tensor (B, T_label).
- midi_lengths (Optional *[*Tensor ]) – Midi length tensor (B,).
- duration_phn (Optional *[*Tensor ]) – Duration tensor (B, T_label).
- duration_phn_lengths (Optional *[*Tensor ]) – Duration length tensor (B,).
- duration_ruled_phn (Optional *[*Tensor ]) – Duration tensor (B, T_phone).
- duration_ruled_phn_lengths (Optional *[*Tensor ]) – Duration length tensor (B,).
- duration_syb (Optional *[*Tensor ]) – Duration tensor (B, T_syllable).
- duration_syb_lengths (Optional *[*Tensor ]) – Duration length tensor (B,).
- slur (Optional *[*Tensor ]) – Slur tensor (B, T_slur).
- pitch (Optional *[*Tensor ]) – Pitch tensor (B, T_wav). - f0 sequence.
- pitch_lengths (Optional *[*Tensor ]) – Pitch length tensor (B,).
- energy (Optional *[*Tensor ]) – Energy tensor.
- energy_lengths (Optional *[*Tensor ]) – Energy length tensor (B,).
- ying (Optional *[*Tensor ]) – Ying tensor.
- ying_lengths (Optional *[*Tensor ]) – Ying length tensor (B,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, D).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (B, 1).
- lids (Optional *[*Tensor ]) – Language ID tensor (B, 1).
- forward_generator (bool) – Whether to forward generator.
- kwargs – Additional arguments, with “utt_id” being among the input.
- Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
- Return type: Dict[str, Any]
########### Examples
>>> model = ESPnetGANSVSModel(...)
>>> loss_info = model.forward(
... text=text_tensor,
... text_lengths=text_lengths_tensor,
... singing=singing_tensor,
... singing_lengths=singing_lengths_tensor,
... )
>>> print(loss_info['loss'])
####### NOTE The method requires that the svs object has both generator and discriminator modules registered.
inference(text: Tensor, singing: Tensor | None = None, label: Tensor | None = None, phn_cnt: Tensor | None = None, midi: Tensor | None = None, duration_phn: Tensor | None = None, duration_ruled_phn: Tensor | None = None, duration_syb: Tensor | None = None, slur: Tensor | None = None, pitch: Tensor | None = None, energy: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, **decode_config) → Dict[str, Tensor]
Calculate features and return them as a dict.
This method performs inference by calculating various features based on the input text and optional singing waveform, returning a dictionary containing the generated outputs.
- Parameters:
- text (Tensor) – Text index tensor (T_text).
- singing (Tensor , optional) – Singing waveform tensor (T_wav).
- label (Tensor , optional) – Label tensor (T_label).
- phn_cnt (Tensor , optional) – Number of phones in each syllable (T_syb).
- midi (Tensor , optional) – Midi tensor (T_label).
- duration_phn (Tensor , optional) – Duration tensor (T_label).
- duration_ruled_phn (Tensor , optional) – Duration tensor (T_phone).
- duration_syb (Tensor , optional) – Duration tensor (T_phone).
- slur (Tensor , optional) – Slur tensor (T_phone).
- pitch (Tensor , optional) – Pitch tensor (T_wav).
- energy (Tensor , optional) – Energy tensor.
- spembs (Tensor , optional) – Speaker embedding tensor (D,).
- sids (Tensor , optional) – Speaker ID tensor (1,).
- lids (Tensor , optional) – Language ID tensor (1,).
- **decode_config – Additional decoding configurations.
- Returns: A dictionary containing the generated outputs, which may include features like “feat_gen” and others based on the inference process.
- Return type: Dict[str, Tensor]
- Raises:
- RuntimeError – If ‘singing’ is required but not provided when
- using teacher forcing. –
########### Examples
>>> text_tensor = torch.tensor([1, 2, 3, 4])
>>> singing_tensor = torch.tensor([0.1, 0.2, 0.3])
>>> output = model.inference(text_tensor, singing=singing_tensor)
>>> print(output.keys())
dict_keys(['feat_gen', ...]) # Output will depend on the model
####### NOTE The input tensors must be properly shaped as per the expected dimensions for the model to function correctly. Ensure that the appropriate decode configurations are provided as needed.