espnet2.gan_svs.joint.joint_score2wav.JointScore2Wav

About 10 min

espnet2.gan_svs.joint.joint_score2wav.JointScore2Wav

class espnet2.gan_svs.joint.joint_score2wav.JointScore2Wav(idim: int, odim: int, segment_size: int = 32, sampling_rate: int = 22050, score2mel_type: str = 'xiaoice', score2mel_params: Dict[str, Any] = {'adim': 384, 'aheads': 4, 'conformer_activation_type': 'swish', 'conformer_dec_kernel_size': 31, 'conformer_enc_kernel_size': 7, 'conformer_pos_enc_layer_type': 'rel_pos', 'conformer_rel_pos_type': 'latest', 'conformer_self_attn_layer_type': 'rel_selfattn', 'decoder_concat_after': False, 'decoder_normalize_before': True, 'decoder_type': 'transformer', 'dlayers': 6, 'dunits': 1536, 'duration_predictor_chans': 384, 'duration_predictor_dropout_rate': 0.1, 'duration_predictor_kernel_size': 3, 'duration_predictor_layers': 2, 'elayers': 6, 'encoder_concat_after': False, 'encoder_normalize_before': True, 'encoder_type': 'transformer', 'eunits': 1536, 'init_dec_alpha': 1.0, 'init_enc_alpha': 1.0, 'init_type': 'xavier_uniform', 'lambda_dur': 0.1, 'lambda_mel': 1, 'lambda_pitch': 0.01, 'lambda_vuv': 0.01, 'langs': None, 'loss_function': 'XiaoiceSing2', 'loss_type': 'L1', 'midi_dim': 129, 'positionwise_conv_kernel_size': 1, 'positionwise_layer_type': 'conv1d', 'postnet_chans': 512, 'postnet_dropout_rate': 0.5, 'postnet_filts': 5, 'postnet_layers': 5, 'reduction_factor': 1, 'spk_embed_dim': None, 'spk_embed_integration_type': 'add', 'spks': None, 'tempo_dim': 500, 'transformer_dec_attn_dropout_rate': 0.1, 'transformer_dec_dropout_rate': 0.1, 'transformer_dec_positional_dropout_rate': 0.1, 'transformer_enc_attn_dropout_rate': 0.1, 'transformer_enc_dropout_rate': 0.1, 'transformer_enc_positional_dropout_rate': 0.1, 'use_batch_norm': True, 'use_cnn_in_conformer': True, 'use_macaron_style_in_conformer': True, 'use_masking': False, 'use_scaled_pos_enc': True, 'use_weighted_masking': False, 'zero_triu': False}, vocoder_type: str = 'hifigan_generator', vocoder_params: Dict[str, Any] = {'bias': True, 'channels': 512, 'global_channels': -1, 'kernel_size': 7, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'resblock_kernel_sizes': [3, 7, 11], 'upsample_kernel_sizes': [16, 16, 4, 4], 'upsample_scales': [8, 8, 2, 2], 'use_additional_convs': True, 'use_weight_norm': True}, use_pqmf: bool = False, pqmf_params: Dict[str, Any] = {'beta': 9.0, 'cutoff_ratio': 0.142, 'subbands': 4, 'taps': 62}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, use_feat_match_loss: bool = True, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, use_mel_loss: bool = True, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_score2mel: float = 1.0, lambda_adv: float = 1.0, lambda_feat_match: float = 2.0, lambda_mel: float = 45.0, cache_generator_outputs: bool = False)

Bases: AbsGANSVS

General class to jointly train score2mel and vocoder parts.

This class is designed for end-to-end training of a score-to-mel model and a vocoder, enabling the generation of high-quality singing voice waveforms from textual input and acoustic features. The architecture supports various configurations for score-to-mel models and vocoders.

segment_size

Segment size for random windowed inputs.

Type: int

use_pqmf

Whether to use PQMF for multi-band vocoder.

Type: bool

generator

Dictionary containing the score2mel and vocoder models.

Type: torch.nn.ModuleDict

discriminator

Discriminator model for adversarial training.

Type: object

generator

_adv_loss

Loss function for the generator’s adversarial training.

Type: object

discriminator

_adv_loss

Loss function for the discriminator’s adversarial training.

Type: object

use_feat_match_loss

Whether to use feature matching loss.

Type: bool

use_mel_loss

Whether to use mel spectrogram loss.

Type: bool

Sampling rate for saving waveform during inference.

Type: int

_cache

Cached outputs for generator to reuse.

Type: Optional[tuple]
Parameters:
- idim (int) – Input vocabulary size.
- odim (int) – Acoustic feature dimension.
- segment_size (int , optional) – Segment size for random windowed inputs. Defaults to 32.
- sampling_rate (int , optional) – Sampling rate for saving waveforms. Defaults to 22050.
- score2mel_type (str , optional) – Type of score2mel model. Defaults to “xiaoice”.
- score2mel_params (Dict *[*str , Any ] , optional) – Parameters for the score2mel model. Defaults to a predefined set.
- vocoder_type (str , optional) – Type of vocoder model. Defaults to “hifigan_generator”.
- vocoder_params (Dict *[*str , Any ] , optional) – Parameters for the vocoder model. Defaults to a predefined set.
- use_pqmf (bool , optional) – Whether to use PQMF for multi-band vocoder. Defaults to False.
- pqmf_params (Dict *[*str , Any ] , optional) – Parameters for PQMF module. Defaults to a predefined set.
- discriminator_type (str , optional) – Type of discriminator model. Defaults to “hifigan_multi_scale_multi_period_discriminator”.
- discriminator_params (Dict *[*str , Any ] , optional) – Parameters for the discriminator. Defaults to a predefined set.
- generator_adv_loss_params (Dict *[*str , Any ] , optional) – Parameters for generator adversarial loss. Defaults to a predefined set.
- discriminator_adv_loss_params (Dict *[*str , Any ] , optional) – Parameters for discriminator adversarial loss. Defaults to a predefined set.
- use_feat_match_loss (bool , optional) – Whether to use feature match loss. Defaults to True.
- feat_match_loss_params (Dict *[*str , Any ] , optional) – Parameters for feature match loss. Defaults to a predefined set.
- use_mel_loss (bool , optional) – Whether to use mel loss. Defaults to True.
- mel_loss_params (Dict *[*str , Any ] , optional) – Parameters for mel loss. Defaults to a predefined set.
- lambda_score2mel (float , optional) – Loss scaling coefficient for score2mel model loss. Defaults to 1.0.
- lambda_adv (float , optional) – Loss scaling coefficient for adversarial loss. Defaults to 1.0.
- lambda_feat_match (float , optional) – Loss scaling coefficient for feature match loss. Defaults to 2.0.
- lambda_mel (float , optional) – Loss scaling coefficient for mel loss. Defaults to 45.0.
- cache_generator_outputs (bool , optional) – Whether to cache generator outputs. Defaults to False.

############# Examples

Create an instance of JointScore2Wav

model = JointScore2Wav(idim=256, odim=80, segment_size=32)

Forward pass through the model

output = model(

text=torch.randint(0, 100, (8, 50)), text_lengths=torch.tensor([50]*8), feats=torch.randn(8, 100, 80), feats_lengths=torch.tensor([100]*8), singing=torch.randn(8, 16000), singing_lengths=torch.tensor([16000]*8)

)

Run inference

wav_output = model.inference(

text=torch.randint(0, 100, (50,)), feats=torch.randn(100, 80)

)

Initialize JointScore2Wav module.

Parameters:
- idim (int) – Input vocabrary size.
- odim (int) – Acoustic feature dimension. The actual output channels will be 1 since the model is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.
- segment_size (int) – Segment size for random windowed inputs.
- sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.
- text2mel_type (str) – The text2mel model type.
- text2mel_params (Dict *[*str , Any ]) – Parameter dict for text2mel model.
- use_pqmf (bool) – Whether to use PQMF for multi-band vocoder.
- pqmf_params (Dict *[*str , Any ]) – Parameter dict for PQMF module.
- vocoder_type (str) – The vocoder model type.
- vocoder_params (Dict *[*str , Any ]) – Parameter dict for vocoder model.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- use_feat_match_loss (bool) – Whether to use feat match loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feat match loss.
- use_mel_loss (bool) – Whether to use mel loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_text2mel (float) – Loss scaling coefficient for text2mel model loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_feat_match (float) – Loss scaling coefficient for feat match loss.
- lambda_mel (float) – Loss scaling coefficient for mel loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.

forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, singing: Tensor, singing_lengths: Tensor, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: LongTensor | None = None, duration: Dict[str, Tensor] | None = None, slur: LongTensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, forward_generator: bool = True) → Dict[str, Any]

Perform generator or discriminator forward pass.

This method is responsible for executing either the generator or discriminator forward pass based on the provided forward_generator flag. It computes the loss and statistics required for training the model.

Parameters:
- text (LongTensor) – Batch of padded character ids (B, Tmax).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, Lmax, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- singing (Tensor) – Singing waveform tensor (B, T_wav).
- singing_lengths (Tensor) – Singing length tensor (B,).
- label (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
- label_lengths (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
- melody (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- duration (Optional *[*Dict ]) – Key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).
- slur (FloatTensor) – Batch of padded slur (B, Tmax).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- forward_generator (bool) – Whether to forward generator.
Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
Return type: Dict[str, Any]

############# Examples

>>> text = torch.randint(0, 100, (32, 50))  # Batch of text
>>> text_lengths = torch.randint(1, 50, (32,))
>>> feats = torch.rand(32, 100, 80)  # Target features
>>> feats_lengths = torch.randint(1, 100, (32,))
>>> singing = torch.rand(32, 16000)  # Singing waveform
>>> singing_lengths = torch.randint(1, 16000, (32,))
>>> output = model.forward(
...     text, text_lengths, feats, feats_lengths,
...     singing, singing_lengths, forward_generator=True
... )
>>> print(output.keys())  # Should print keys: loss, stats, weight, optim_idx

inference(text: Tensor, feats: Tensor | None = None, label: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, duration: Dict[str, Tensor] | None = None, slur: Dict[str, Tensor] | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: int | None = None, use_teacher_forcing: bool = False) → Dict[str, Tensor]

General class to jointly train score2mel and vocoder parts.

This class implements a joint training approach for a score-to-mel model and a vocoder. It supports various configurations for different models, allowing for flexibility in training and inference.

segment_size

Segment size for random windowed inputs.

Type: int

use_pqmf

Whether to use PQMF for multi-band vocoder.

Type: bool

generator

Dictionary containing the score2mel and vocoder models.

Type: torch.nn.ModuleDict

discriminator

Discriminator model.

Type: Discriminator

generator

_adv_loss

Adversarial loss for the generator.

Type:GeneratorAdversarialLoss

discriminator

_adv_loss

Adversarial loss for the discriminator.

Type:DiscriminatorAdversarialLoss

use_feat_match_loss

Flag indicating if feature match loss is used.

Type: bool

feat_match_loss

Feature match loss module.

Type:FeatureMatchLoss

use_mel_loss

Flag indicating if mel loss is used.

Type: bool

mel_loss

Mel spectrogram loss module.

Type:MelSpectrogramLoss

lambda_score2mel

Loss scaling coefficient for score2mel model loss.

Type: float

lambda_adv

Loss scaling coefficient for adversarial loss.

Type: float

lambda_feat_match

Loss scaling coefficient for feature match loss.

Type: float

lambda_mel

Loss scaling coefficient for mel loss.

Type: float

cache_generator_outputs

Whether to cache generator outputs.

Type: bool

Sampling rate for saving waveform during inference.

Type: int

_cache

Cached outputs for generator during training.

Type: Optional[Tuple]
Parameters:
- idim (int) – Input vocabulary size.
- odim (int) – Acoustic feature dimension.
- segment_size (int) – Segment size for random windowed inputs.
- sampling_rate (int) – Sampling rate for saving waveform.
- score2mel_type (str) – The score2mel model type.
- score2mel_params (Dict *[*str , Any ]) – Parameter dict for score2mel model.
- vocoder_type (str) – The vocoder model type.
- vocoder_params (Dict *[*str , Any ]) – Parameter dict for vocoder model.
- use_pqmf (bool) – Whether to use PQMF for multi-band vocoder.
- pqmf_params (Dict *[*str , Any ]) – Parameter dict for PQMF module.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- use_feat_match_loss (bool) – Whether to use feature match loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feature match loss.
- use_mel_loss (bool) – Whether to use mel loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_score2mel (float) – Loss scaling coefficient for score2mel model loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_feat_match (float) – Loss scaling coefficient for feature match loss.
- lambda_mel (float) – Loss scaling coefficient for mel loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.

############# Examples

>>> model = JointScore2Wav(idim=256, odim=80)
>>> output = model.forward(text, text_lengths, feats, feats_lengths,
...                         singing, singing_lengths)
>>> inference_output = model.inference(text, feats)

property require_raw_singing

Return whether or not singing is required.

This property indicates if the model requires raw singing data as input for its operations. In the context of the JointScore2Wav model, this property is set to True, meaning that the model depends on the raw singing waveform for training and inference.

Returns: True if raw singing is required, False otherwise.
Return type: bool

############# Examples

>>> model = JointScore2Wav(...)
>>> model.require_raw_singing
True

property require_vocoder

General class to jointly train score2mel and vocoder parts.

This class integrates the score-to-melody conversion and vocoding processes in a single model. It is designed for end-to-end training of singing voice synthesis systems, allowing for efficient training and inference.

segment_size

Size of segments for random windowed inputs.

Type: int

use_pqmf

Flag indicating whether to use PQMF for multi-band vocoding.

Type: bool

generator

Dictionary containing the score-to-melody and vocoder generators.

Type: torch.nn.ModuleDict

discriminator

Discriminator for adversarial training.

Type: object

generator

_adv_loss

Loss function for generator’s adversarial training.

Type: object

discriminator

_adv_loss

Loss function for discriminator’s adversarial training.

Type: object

use_feat_match_loss

Flag indicating whether to use feature matching loss.

Type: bool

feat_match_loss

Loss function for feature matching.

Type: object

use_mel_loss

Flag indicating whether to use mel loss.

Type: bool

mel_loss

Loss function for mel spectrogram matching.

Type: object

lambda_score2mel

Coefficient for scaling score-to-mel loss.

Type: float

lambda_adv

Coefficient for scaling adversarial loss.

Type: float

lambda_feat_match

Coefficient for scaling feature match loss.

Type: float

lambda_mel

Coefficient for scaling mel loss.

Type: float

cache_generator_outputs

Flag indicating whether to cache generator outputs.

Type: bool

Sampling rate for saving waveforms during inference.

Type: int

spks

List of speaker IDs for compatibility.

Type: list

langs

List of language IDs for compatibility.

Type: list

spk_embed_dim

Dimension of speaker embeddings.

Type: int
Parameters:
- idim (int) – Input vocabulary size.
- odim (int) – Acoustic feature dimension. The actual output channels will be 1 since the model is the end-to-end text-to-wave model but for compatibility odim is used to indicate the acoustic feature dimension.
- segment_size (int) – Segment size for random windowed inputs.
- sampling_rate (int) – Sampling rate, not used for training but referred to in saving waveform during inference.
- score2mel_type (str) – The text-to-melody model type.
- score2mel_params (Dict *[*str , Any ]) – Parameter dict for text-to-melody model.
- vocoder_type (str) – The vocoder model type.
- vocoder_params (Dict *[*str , Any ]) – Parameter dict for vocoder model.
- use_pqmf (bool) – Whether to use PQMF for multi-band vocoder.
- pqmf_params (Dict *[*str , Any ]) – Parameter dict for PQMF module.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- use_feat_match_loss (bool) – Whether to use feature match loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feature match loss.
- use_mel_loss (bool) – Whether to use mel loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_score2mel (float) – Loss scaling coefficient for text-to-melody model loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_feat_match (float) – Loss scaling coefficient for feature match loss.
- lambda_mel (float) – Loss scaling coefficient for mel loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.

############# Examples

>>> model = JointScore2Wav(idim=100, odim=80)
>>> output = model.forward(text, text_lengths, feats, feats_lengths,
... singing, singing_lengths)

NOTE

The class requires the PyTorch library and the ESPnet2 framework for GAN-based speech synthesis. Ensure that the necessary dependencies are installed.