espnet2.gan_svs.joint.joint_score2wav.JointScore2Wav
espnet2.gan_svs.joint.joint_score2wav.JointScore2Wav
class espnet2.gan_svs.joint.joint_score2wav.JointScore2Wav(idim: int, odim: int, segment_size: int = 32, sampling_rate: int = 22050, score2mel_type: str = 'xiaoice', score2mel_params: Dict[str, Any] = {'adim': 384, 'aheads': 4, 'conformer_activation_type': 'swish', 'conformer_dec_kernel_size': 31, 'conformer_enc_kernel_size': 7, 'conformer_pos_enc_layer_type': 'rel_pos', 'conformer_rel_pos_type': 'latest', 'conformer_self_attn_layer_type': 'rel_selfattn', 'decoder_concat_after': False, 'decoder_normalize_before': True, 'decoder_type': 'transformer', 'dlayers': 6, 'dunits': 1536, 'duration_predictor_chans': 384, 'duration_predictor_dropout_rate': 0.1, 'duration_predictor_kernel_size': 3, 'duration_predictor_layers': 2, 'elayers': 6, 'encoder_concat_after': False, 'encoder_normalize_before': True, 'encoder_type': 'transformer', 'eunits': 1536, 'init_dec_alpha': 1.0, 'init_enc_alpha': 1.0, 'init_type': 'xavier_uniform', 'lambda_dur': 0.1, 'lambda_mel': 1, 'lambda_pitch': 0.01, 'lambda_vuv': 0.01, 'langs': None, 'loss_function': 'XiaoiceSing2', 'loss_type': 'L1', 'midi_dim': 129, 'positionwise_conv_kernel_size': 1, 'positionwise_layer_type': 'conv1d', 'postnet_chans': 512, 'postnet_dropout_rate': 0.5, 'postnet_filts': 5, 'postnet_layers': 5, 'reduction_factor': 1, 'spk_embed_dim': None, 'spk_embed_integration_type': 'add', 'spks': None, 'tempo_dim': 500, 'transformer_dec_attn_dropout_rate': 0.1, 'transformer_dec_dropout_rate': 0.1, 'transformer_dec_positional_dropout_rate': 0.1, 'transformer_enc_attn_dropout_rate': 0.1, 'transformer_enc_dropout_rate': 0.1, 'transformer_enc_positional_dropout_rate': 0.1, 'use_batch_norm': True, 'use_cnn_in_conformer': True, 'use_macaron_style_in_conformer': True, 'use_masking': False, 'use_scaled_pos_enc': True, 'use_weighted_masking': False, 'zero_triu': False}, vocoder_type: str = 'hifigan_generator', vocoder_params: Dict[str, Any] = {'bias': True, 'channels': 512, 'global_channels': -1, 'kernel_size': 7, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'resblock_kernel_sizes': [3, 7, 11], 'upsample_kernel_sizes': [16, 16, 4, 4], 'upsample_scales': [8, 8, 2, 2], 'use_additional_convs': True, 'use_weight_norm': True}, use_pqmf: bool = False, pqmf_params: Dict[str, Any] = {'beta': 9.0, 'cutoff_ratio': 0.142, 'subbands': 4, 'taps': 62}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, use_feat_match_loss: bool = True, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, use_mel_loss: bool = True, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_score2mel: float = 1.0, lambda_adv: float = 1.0, lambda_feat_match: float = 2.0, lambda_mel: float = 45.0, cache_generator_outputs: bool = False)
Bases: AbsGANSVS
General class to jointly train score2mel and vocoder parts.
This class is designed for end-to-end training of a score-to-mel model and a vocoder, enabling the generation of high-quality singing voice waveforms from textual input and acoustic features. The architecture supports various configurations for score-to-mel models and vocoders.
segment_size
Segment size for random windowed inputs.
- Type: int
use_pqmf
Whether to use PQMF for multi-band vocoder.
- Type: bool
generator
Dictionary containing the score2mel and vocoder models.
- Type: torch.nn.ModuleDict
discriminator
Discriminator model for adversarial training.
- Type: object
generator
Loss function for the generator’s adversarial training.
- Type: object
discriminator
Loss function for the discriminator’s adversarial training.
- Type: object
use_feat_match_loss
Whether to use feature matching loss.
- Type: bool
use_mel_loss
Whether to use mel spectrogram loss.
- Type: bool
fs
Sampling rate for saving waveform during inference.
- Type: int
_cache
Cached outputs for generator to reuse.
Type: Optional[tuple]
Parameters:
- idim (int) – Input vocabulary size.
- odim (int) – Acoustic feature dimension.
- segment_size (int , optional) – Segment size for random windowed inputs. Defaults to 32.
- sampling_rate (int , optional) – Sampling rate for saving waveforms. Defaults to 22050.
- score2mel_type (str , optional) – Type of score2mel model. Defaults to “xiaoice”.
- score2mel_params (Dict *[*str , Any ] , optional) – Parameters for the score2mel model. Defaults to a predefined set.
- vocoder_type (str , optional) – Type of vocoder model. Defaults to “hifigan_generator”.
- vocoder_params (Dict *[*str , Any ] , optional) – Parameters for the vocoder model. Defaults to a predefined set.
- use_pqmf (bool , optional) – Whether to use PQMF for multi-band vocoder. Defaults to False.
- pqmf_params (Dict *[*str , Any ] , optional) – Parameters for PQMF module. Defaults to a predefined set.
- discriminator_type (str , optional) – Type of discriminator model. Defaults to “hifigan_multi_scale_multi_period_discriminator”.
- discriminator_params (Dict *[*str , Any ] , optional) – Parameters for the discriminator. Defaults to a predefined set.
- generator_adv_loss_params (Dict *[*str , Any ] , optional) – Parameters for generator adversarial loss. Defaults to a predefined set.
- discriminator_adv_loss_params (Dict *[*str , Any ] , optional) – Parameters for discriminator adversarial loss. Defaults to a predefined set.
- use_feat_match_loss (bool , optional) – Whether to use feature match loss. Defaults to True.
- feat_match_loss_params (Dict *[*str , Any ] , optional) – Parameters for feature match loss. Defaults to a predefined set.
- use_mel_loss (bool , optional) – Whether to use mel loss. Defaults to True.
- mel_loss_params (Dict *[*str , Any ] , optional) – Parameters for mel loss. Defaults to a predefined set.
- lambda_score2mel (float , optional) – Loss scaling coefficient for score2mel model loss. Defaults to 1.0.
- lambda_adv (float , optional) – Loss scaling coefficient for adversarial loss. Defaults to 1.0.
- lambda_feat_match (float , optional) – Loss scaling coefficient for feature match loss. Defaults to 2.0.
- lambda_mel (float , optional) – Loss scaling coefficient for mel loss. Defaults to 45.0.
- cache_generator_outputs (bool , optional) – Whether to cache generator outputs. Defaults to False.
############# Examples
Create an instance of JointScore2Wav
model = JointScore2Wav(idim=256, odim=80, segment_size=32)
Forward pass through the model
output = model(
text=torch.randint(0, 100, (8, 50)), text_lengths=torch.tensor([50]*8), feats=torch.randn(8, 100, 80), feats_lengths=torch.tensor([100]*8), singing=torch.randn(8, 16000), singing_lengths=torch.tensor([16000]*8)
)
Run inference
wav_output = model.inference(
text=torch.randint(0, 100, (50,)), feats=torch.randn(100, 80)
)
Initialize JointScore2Wav module.
- Parameters:
- idim (int) – Input vocabrary size.
- odim (int) – Acoustic feature dimension. The actual output channels will be 1 since the model is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.
- segment_size (int) – Segment size for random windowed inputs.
- sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.
- text2mel_type (str) – The text2mel model type.
- text2mel_params (Dict *[*str , Any ]) – Parameter dict for text2mel model.
- use_pqmf (bool) – Whether to use PQMF for multi-band vocoder.
- pqmf_params (Dict *[*str , Any ]) – Parameter dict for PQMF module.
- vocoder_type (str) – The vocoder model type.
- vocoder_params (Dict *[*str , Any ]) – Parameter dict for vocoder model.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- use_feat_match_loss (bool) – Whether to use feat match loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feat match loss.
- use_mel_loss (bool) – Whether to use mel loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_text2mel (float) – Loss scaling coefficient for text2mel model loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_feat_match (float) – Loss scaling coefficient for feat match loss.
- lambda_mel (float) – Loss scaling coefficient for mel loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.
forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, singing: Tensor, singing_lengths: Tensor, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: LongTensor | None = None, duration: Dict[str, Tensor] | None = None, slur: LongTensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, forward_generator: bool = True) → Dict[str, Any]
Perform generator or discriminator forward pass.
This method is responsible for executing either the generator or discriminator forward pass based on the provided forward_generator flag. It computes the loss and statistics required for training the model.
- Parameters:
- text (LongTensor) – Batch of padded character ids (B, Tmax).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, Lmax, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- singing (Tensor) – Singing waveform tensor (B, T_wav).
- singing_lengths (Tensor) – Singing length tensor (B,).
- label (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, Tmax).
- label_lengths (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B, ).
- melody (Optional *[*Dict ]) – Key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, Tmax).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- duration (Optional *[*Dict ]) – Key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, Tmax).
- slur (FloatTensor) – Batch of padded slur (B, Tmax).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- forward_generator (bool) – Whether to forward generator.
- Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
- Return type: Dict[str, Any]
############# Examples
>>> text = torch.randint(0, 100, (32, 50)) # Batch of text
>>> text_lengths = torch.randint(1, 50, (32,))
>>> feats = torch.rand(32, 100, 80) # Target features
>>> feats_lengths = torch.randint(1, 100, (32,))
>>> singing = torch.rand(32, 16000) # Singing waveform
>>> singing_lengths = torch.randint(1, 16000, (32,))
>>> output = model.forward(
... text, text_lengths, feats, feats_lengths,
... singing, singing_lengths, forward_generator=True
... )
>>> print(output.keys()) # Should print keys: loss, stats, weight, optim_idx
inference(text: Tensor, feats: Tensor | None = None, label: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, duration: Dict[str, Tensor] | None = None, slur: Dict[str, Tensor] | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: int | None = None, use_teacher_forcing: bool = False) → Dict[str, Tensor]
General class to jointly train score2mel and vocoder parts.
This class implements a joint training approach for a score-to-mel model and a vocoder. It supports various configurations for different models, allowing for flexibility in training and inference.
segment_size
Segment size for random windowed inputs.
- Type: int
use_pqmf
Whether to use PQMF for multi-band vocoder.
- Type: bool
generator
Dictionary containing the score2mel and vocoder models.
- Type: torch.nn.ModuleDict
discriminator
Discriminator model.
- Type: Discriminator
generator
Adversarial loss for the generator.
discriminator
Adversarial loss for the discriminator.
use_feat_match_loss
Flag indicating if feature match loss is used.
- Type: bool
feat_match_loss
Feature match loss module.
- Type:FeatureMatchLoss
use_mel_loss
Flag indicating if mel loss is used.
- Type: bool
mel_loss
Mel spectrogram loss module.
- Type:MelSpectrogramLoss
lambda_score2mel
Loss scaling coefficient for score2mel model loss.
- Type: float
lambda_adv
Loss scaling coefficient for adversarial loss.
- Type: float
lambda_feat_match
Loss scaling coefficient for feature match loss.
- Type: float
lambda_mel
Loss scaling coefficient for mel loss.
- Type: float
cache_generator_outputs
Whether to cache generator outputs.
- Type: bool
fs
Sampling rate for saving waveform during inference.
- Type: int
_cache
Cached outputs for generator during training.
Type: Optional[Tuple]
Parameters:
- idim (int) – Input vocabulary size.
- odim (int) – Acoustic feature dimension.
- segment_size (int) – Segment size for random windowed inputs.
- sampling_rate (int) – Sampling rate for saving waveform.
- score2mel_type (str) – The score2mel model type.
- score2mel_params (Dict *[*str , Any ]) – Parameter dict for score2mel model.
- vocoder_type (str) – The vocoder model type.
- vocoder_params (Dict *[*str , Any ]) – Parameter dict for vocoder model.
- use_pqmf (bool) – Whether to use PQMF for multi-band vocoder.
- pqmf_params (Dict *[*str , Any ]) – Parameter dict for PQMF module.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- use_feat_match_loss (bool) – Whether to use feature match loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feature match loss.
- use_mel_loss (bool) – Whether to use mel loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_score2mel (float) – Loss scaling coefficient for score2mel model loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_feat_match (float) – Loss scaling coefficient for feature match loss.
- lambda_mel (float) – Loss scaling coefficient for mel loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.
############# Examples
>>> model = JointScore2Wav(idim=256, odim=80)
>>> output = model.forward(text, text_lengths, feats, feats_lengths,
... singing, singing_lengths)
>>> inference_output = model.inference(text, feats)
property require_raw_singing
Return whether or not singing is required.
This property indicates if the model requires raw singing data as input for its operations. In the context of the JointScore2Wav model, this property is set to True, meaning that the model depends on the raw singing waveform for training and inference.
- Returns: True if raw singing is required, False otherwise.
- Return type: bool
############# Examples
>>> model = JointScore2Wav(...)
>>> model.require_raw_singing
True
property require_vocoder
General class to jointly train score2mel and vocoder parts.
This class integrates the score-to-melody conversion and vocoding processes in a single model. It is designed for end-to-end training of singing voice synthesis systems, allowing for efficient training and inference.
segment_size
Size of segments for random windowed inputs.
- Type: int
use_pqmf
Flag indicating whether to use PQMF for multi-band vocoding.
- Type: bool
generator
Dictionary containing the score-to-melody and vocoder generators.
- Type: torch.nn.ModuleDict
discriminator
Discriminator for adversarial training.
- Type: object
generator
Loss function for generator’s adversarial training.
- Type: object
discriminator
Loss function for discriminator’s adversarial training.
- Type: object
use_feat_match_loss
Flag indicating whether to use feature matching loss.
- Type: bool
feat_match_loss
Loss function for feature matching.
- Type: object
use_mel_loss
Flag indicating whether to use mel loss.
- Type: bool
mel_loss
Loss function for mel spectrogram matching.
- Type: object
lambda_score2mel
Coefficient for scaling score-to-mel loss.
- Type: float
lambda_adv
Coefficient for scaling adversarial loss.
- Type: float
lambda_feat_match
Coefficient for scaling feature match loss.
- Type: float
lambda_mel
Coefficient for scaling mel loss.
- Type: float
cache_generator_outputs
Flag indicating whether to cache generator outputs.
- Type: bool
fs
Sampling rate for saving waveforms during inference.
- Type: int
spks
List of speaker IDs for compatibility.
- Type: list
langs
List of language IDs for compatibility.
- Type: list
spk_embed_dim
Dimension of speaker embeddings.
Type: int
Parameters:
- idim (int) – Input vocabulary size.
- odim (int) – Acoustic feature dimension. The actual output channels will be 1 since the model is the end-to-end text-to-wave model but for compatibility odim is used to indicate the acoustic feature dimension.
- segment_size (int) – Segment size for random windowed inputs.
- sampling_rate (int) – Sampling rate, not used for training but referred to in saving waveform during inference.
- score2mel_type (str) – The text-to-melody model type.
- score2mel_params (Dict *[*str , Any ]) – Parameter dict for text-to-melody model.
- vocoder_type (str) – The vocoder model type.
- vocoder_params (Dict *[*str , Any ]) – Parameter dict for vocoder model.
- use_pqmf (bool) – Whether to use PQMF for multi-band vocoder.
- pqmf_params (Dict *[*str , Any ]) – Parameter dict for PQMF module.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- use_feat_match_loss (bool) – Whether to use feature match loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feature match loss.
- use_mel_loss (bool) – Whether to use mel loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_score2mel (float) – Loss scaling coefficient for text-to-melody model loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_feat_match (float) – Loss scaling coefficient for feature match loss.
- lambda_mel (float) – Loss scaling coefficient for mel loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.
############# Examples
>>> model = JointScore2Wav(idim=100, odim=80)
>>> output = model.forward(text, text_lengths, feats, feats_lengths,
... singing, singing_lengths)
NOTE
The class requires the PyTorch library and the ESPnet2 framework for GAN-based speech synthesis. Ensure that the necessary dependencies are installed.