espnet2.gan_svs.vits.generator.VISingerGenerator

About 8 min

espnet2.gan_svs.vits.generator.VISingerGenerator

class espnet2.gan_svs.vits.generator.VISingerGenerator(vocabs: int, aux_channels: int = 513, hidden_channels: int = 192, spks: int | None = None, langs: int | None = None, spk_embed_dim: int | None = None, global_channels: int = -1, segment_size: int = 32, text_encoder_attention_heads: int = 2, text_encoder_ffn_expand: int = 4, text_encoder_blocks: int = 6, text_encoder_positionwise_layer_type: str = 'conv1d', text_encoder_positionwise_conv_kernel_size: int = 1, text_encoder_positional_encoding_layer_type: str = 'rel_pos', text_encoder_self_attention_layer_type: str = 'rel_selfattn', text_encoder_activation_type: str = 'swish', text_encoder_normalize_before: bool = True, text_encoder_dropout_rate: float = 0.1, text_encoder_positional_dropout_rate: float = 0.0, text_encoder_attention_dropout_rate: float = 0.0, text_encoder_conformer_kernel_size: int = 7, use_macaron_style_in_text_encoder: bool = True, use_conformer_conv_in_text_encoder: bool = True, decoder_kernel_size: int = 7, decoder_channels: int = 512, decoder_downsample_scales: List[int] = [2, 2, 8, 8], decoder_downsample_kernel_sizes: List[int] = [4, 4, 16, 16], decoder_upsample_scales: List[int] = [8, 8, 2, 2], decoder_upsample_kernel_sizes: List[int] = [16, 16, 4, 4], decoder_resblock_kernel_sizes: List[int] = [3, 7, 11], decoder_resblock_dilations: List[List[int]] = [[1, 3, 5], [1, 3, 5], [1, 3, 5]], use_avocodo=False, projection_filters: List[int] = [0, 1, 1, 1], projection_kernels: List[int] = [0, 5, 7, 11], n_harmonic: int = 64, use_weight_norm_in_decoder: bool = True, posterior_encoder_kernel_size: int = 5, posterior_encoder_layers: int = 16, posterior_encoder_stacks: int = 1, posterior_encoder_base_dilation: int = 1, posterior_encoder_dropout_rate: float = 0.0, use_weight_norm_in_posterior_encoder: bool = True, flow_flows: int = 4, flow_kernel_size: int = 5, flow_base_dilation: int = 1, flow_layers: int = 4, flow_dropout_rate: float = 0.0, use_weight_norm_in_flow: bool = True, use_only_mean_in_flow: bool = True, generator_type: str = 'visinger', vocoder_generator_type: str = 'hifigan', fs: int = 22050, hop_length: int = 256, win_length: int | None = 1024, n_fft: int = 1024, use_phoneme_predictor: bool = False, expand_f0_method: str = 'repeat', hubert_channels: int | None = 0)

Bases: Module

Generator module in VISinger.

This module implements the VISinger generator for singing voice synthesis as described in

`VISinger: Variational Inference with Adversarial Learning
for End-to-End Singing Voice Synthesis`_

This generator can be configured with various parameters for the text encoder, decoder, and additional features such as speaker and language embeddings.

aux_channels

Number of acoustic feature channels.

Type: int

hidden_channels

Number of hidden channels.

Type: int

generator_type

Type of generator to use for the model.

Type: str

segment_size

Segment size for the decoder.

Type: int

sample_rate

Sample rate of the audio.

Type: int

hop_length

Number of samples between successive frames in STFT.

Type: int

use_avocodo

Whether to use Avocodo model in the generator.

Type: bool

use_flow

Whether to use flow in the generator.

Type: bool

use_phoneme_predictor

Whether to use phoneme predictor in the model.

Type: bool

text_encoder

The text encoder module.

Type:TextEncoder

decoder (Union[UHiFiGANGenerator, HiFiGANGenerator, AvocodoGenerator,

VISinger2VocoderGenerator]): The vocoder generator module.

posterior_encoder

The posterior encoder module.

Type:PosteriorEncoder

flow

The flow module, if used.

Type: Optional[ResidualAffineCouplingBlock]

duration_predictor

The duration predictor module.

Type:DurationPredictor

The length regulator module.

Type:LengthRegulator

phoneme_predictor

The phoneme predictor.

Type: Optional[PhonemePredictor]

f0_decoder

The pitch decoder module.

Type:Decoder

prior_decoder

The prior decoder module.

Type:PriorDecoder
Parameters:
- vocabs (int) – Input vocabulary size.
- aux_channels (int) – Number of acoustic feature channels.
- hidden_channels (int) – Number of hidden channels.
- spks (Optional *[*int ]) – Number of speakers.
- langs (Optional *[*int ]) – Number of languages.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension.
- global_channels (int) – Number of global conditioning channels.
- segment_size (int) – Segment size for decoder.
- text_encoder_attention_heads (int) – Number of heads in text encoder.
- text_encoder_ffn_expand (int) – Expansion ratio of FFN in text encoder.
- text_encoder_blocks (int) – Number of blocks in text encoder.
- text_encoder_positionwise_layer_type (str) – Position-wise layer type in text encoder.
- text_encoder_positionwise_conv_kernel_size (int) – Convolution kernel size in text encoder.
- text_encoder_positional_encoding_layer_type (str) – Positional encoding layer type.
- text_encoder_self_attention_layer_type (str) – Self-attention layer type.
- text_encoder_activation_type (str) – Activation function type in text encoder.
- text_encoder_normalize_before (bool) – Normalize before self-attention in text encoder.
- text_encoder_dropout_rate (float) – Dropout rate in text encoder.
- text_encoder_positional_dropout_rate (float) – Positional dropout rate in text encoder.
- text_encoder_attention_dropout_rate (float) – Attention dropout rate in text encoder.
- text_encoder_conformer_kernel_size (int) – Conformer kernel size in text encoder.
- use_macaron_style_in_text_encoder (bool) – Use macaron style FFN in text encoder.
- use_conformer_conv_in_text_encoder (bool) – Use convolution in text encoder.
- decoder_kernel_size (int) – Decoder kernel size.
- decoder_channels (int) – Number of decoder initial channels.
- decoder_downsample_scales (List *[*int ]) – List of downsampling scales in decoder.
- decoder_downsample_kernel_sizes (List *[*int ]) – List of kernel sizes for downsampling layers.
- decoder_upsample_scales (List *[*int ]) – List of upsampling scales in decoder.
- decoder_upsample_kernel_sizes (List *[*int ]) – List of kernel sizes for upsampling layers.
- decoder_resblock_kernel_sizes (List *[*int ]) – List of kernel sizes for resblocks in decoder.
- decoder_resblock_dilations (List *[*List *[*int ] ]) – List of dilations for resblocks in decoder.
- use_avocodo (bool) – Whether to use Avocodo model in the generator.
- projection_filters (List *[*int ]) – List of projection filter sizes.
- projection_kernels (List *[*int ]) – List of projection kernel sizes.
- n_harmonic (int) – Number of harmonic components.
- use_weight_norm_in_decoder (bool) – Apply weight normalization in decoder.
- posterior_encoder_kernel_size (int) – Posterior encoder kernel size.
- posterior_encoder_layers (int) – Number of layers in posterior encoder.
- posterior_encoder_stacks (int) – Number of stacks in posterior encoder.
- posterior_encoder_base_dilation (int) – Base dilation in posterior encoder.
- posterior_encoder_dropout_rate (float) – Dropout rate in posterior encoder.
- use_weight_norm_in_posterior_encoder (bool) – Apply weight normalization in posterior encoder.
- flow_flows (int) – Number of flows in flow.
- flow_kernel_size (int) – Kernel size in flow.
- flow_base_dilation (int) – Base dilation in flow.
- flow_layers (int) – Number of layers in flow.
- flow_dropout_rate (float) – Dropout rate in flow.
- use_weight_norm_in_flow (bool) – Apply weight normalization in flow.
- use_only_mean_in_flow (bool) – Use only mean in flow.
- generator_type (str) – Type of generator to use for the model.
- vocoder_generator_type (str) – Type of vocoder generator to use for the model.
- fs (int) – Sample rate of the audio.
- hop_length (int) – Number of samples between successive frames in STFT.
- win_length (Optional *[*int ]) – Window size of the STFT.
- n_fft (int) – Length of the FFT window.
- use_phoneme_predictor (bool) – Whether to use phoneme predictor in the model.
- expand_f0_method (str) – Method used to expand F0. Use “repeat” or “interpolation”.
- hubert_channels (Union *[*int , None ]) – Number of channels in the Hubert model.

Examples

>>> generator = VISingerGenerator(vocabs=100, aux_channels=513)
>>> text = torch.randint(0, 100, (8, 50))
>>> text_lengths = torch.tensor([50] * 8)
>>> feats = torch.randn(8, 100, 513)
>>> feats_lengths = torch.tensor([100] * 8)
>>> output = generator(text, text_lengths, feats, feats_lengths)

NOTE

This implementation is based on the VITS architecture and utilizes various advanced techniques for improved singing voice synthesis.

Raises:ValueError – If an unsupported vocoder generator type is provided.

Initialize VITS generator module.

Parameters:
- vocabs (int) – Input vocabulary size.
- aux_channels (int) – Number of acoustic feature channels.
- hidden_channels (int) – Number of hidden channels.
- spks (Optional *[*int ]) – Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer.
- langs (Optional *[*int ]) – Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer.
- spk_embed_dim (Optional *[*int ]) – Speaker embedding dimension. If set to > 0, assume that spembs will be provided as the input.
- global_channels (int) – Number of global conditioning channels.
- segment_size (int) – Segment size for decoder.
- text_encoder_attention_heads (int) – Number of heads in conformer block of text encoder.
- text_encoder_ffn_expand (int) – Expansion ratio of FFN in conformer block of text encoder.
- text_encoder_blocks (int) – Number of conformer blocks in text encoder.
- text_encoder_positionwise_layer_type (str) – Position-wise layer type in conformer block of text encoder.
- text_encoder_positionwise_conv_kernel_size (int) – Position-wise convolution kernel size in conformer block of text encoder. Only used when the above layer type is conv1d or conv1d-linear.
- text_encoder_positional_encoding_layer_type (str) – Positional encoding layer type in conformer block of text encoder.
- text_encoder_self_attention_layer_type (str) – Self-attention layer type in conformer block of text encoder.
- text_encoder_activation_type (str) – Activation function type in conformer block of text encoder.
- text_encoder_normalize_before (bool) – Whether to apply layer norm before self-attention in conformer block of text encoder.
- text_encoder_dropout_rate (float) – Dropout rate in conformer block of text encoder.
- text_encoder_positional_dropout_rate (float) – Dropout rate for positional encoding in conformer block of text encoder.
- text_encoder_attention_dropout_rate (float) – Dropout rate for attention in conformer block of text encoder.
- text_encoder_conformer_kernel_size (int) – Conformer conv kernel size. It will be used when only use_conformer_conv_in_text_encoder = True.
- use_macaron_style_in_text_encoder (bool) – Whether to use macaron style FFN in conformer block of text encoder.
- use_conformer_conv_in_text_encoder (bool) – Whether to use covolution in conformer block of text encoder.
- decoder_kernel_size (int) – Decoder kernel size.
- decoder_channels (int) – Number of decoder initial channels.
- decoder_downsample_scales (List *[*int ]) – List of downsampling scales in decoder.
- decoder_downsample_kernel_sizes (List *[*int ]) – List of kernel sizes for downsampling layers in decoder.
- decoder_upsample_scales (List *[*int ]) – List of upsampling scales in decoder.
- decoder_upsample_kernel_sizes (List *[*int ]) – List of kernel sizes for upsampling layers in decoder.
- decoder_resblock_kernel_sizes (List *[*int ]) – List of kernel sizes for resblocks in decoder.
- decoder_resblock_dilations (List *[*List *[*int ] ]) – List of list of dilations for resblocks in decoder.
- use_avocodo (bool) – Whether to use Avocodo model in the generator.
- projection_filters (List *[*int ]) – List of projection filter sizes.
- projection_kernels (List *[*int ]) – List of projection kernel sizes.
- n_harmonic (int) – Number of harmonic components.
- use_weight_norm_in_decoder (bool) – Whether to apply weight normalization in decoder.
- posterior_encoder_kernel_size (int) – Posterior encoder kernel size.
- posterior_encoder_layers (int) – Number of layers of posterior encoder.
- posterior_encoder_stacks (int) – Number of stacks of posterior encoder.
- posterior_encoder_base_dilation (int) – Base dilation of posterior encoder.
- posterior_encoder_dropout_rate (float) – Dropout rate for posterior encoder.
- use_weight_norm_in_posterior_encoder (bool) – Whether to apply weight normalization in posterior encoder.
- flow_flows (int) – Number of flows in flow.
- flow_kernel_size (int) – Kernel size in flow.
- flow_base_dilation (int) – Base dilation in flow.
- flow_layers (int) – Number of layers in flow.
- flow_dropout_rate (float) – Dropout rate in flow
- use_weight_norm_in_flow (bool) – Whether to apply weight normalization in flow.
- use_only_mean_in_flow (bool) – Whether to use only mean in flow.
- generator_type (str) – Type of generator to use for the model.
- vocoder_generator_type (str) – Type of vocoder generator to use for the model.
- fs (int) – Sample rate of the audio.
- hop_length (int) – Number of samples between successive frames in STFT.
- win_length (int) – Window size of the STFT.
- n_fft (int) – Length of the FFT window to be used.
- use_phoneme_predictor (bool) – Whether to use phoneme predictor in the model.
- expand_f0_method (str) – The method used to expand F0. Use “repeat” or “interpolation”.
- hubert_channels (int) – Number of channels in the Hubert model. This is used in VISinger2 Plus.

Calculate forward propagation.

Parameters:
- text (LongTensor) – Batch of padded character ids (B, Tmax).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, Lmax, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- label (LongTensor) – Batch of padded label ids (B, Tmax).
- label_lengths (LongTensor) – Batch of the lengths of padded label ids (B, ).
- melody (LongTensor) – Batch of padded midi (B, Tmax).
- gt_dur (LongTensor) – Batch of padded ground truth duration (B, Tmax).
- score_dur (LongTensor) – Batch of padded score duration (B, Tmax).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- ying (Optional *[*Tensor ]) – Batch of padded ying (B, Tmax).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
Returns: Waveform tensor (B, 1, segment_size * upsample_factor). Tensor: Duration negative log-likelihood (NLL) tensor (B,). Tensor: Monotonic attention weight tensor (B, 1, T_feats, T_text). Tensor: Segments start index tensor (B,). Tensor: Text mask tensor (B, 1, T_text). Tensor: Feature mask tensor (B, 1, T_feats). tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor]:
- Tensor: Posterior encoder hidden representation (B, H, T_feats).
- Tensor: Flow hidden representation (B, H, T_feats).
- Tensor: Expanded text encoder projected mean (B, H, T_feats).
- Tensor: Expanded text encoder projected scale (B, H, T_feats).
- Tensor: Posterior encoder projected mean (B, H, T_feats).
- Tensor: Posterior encoder projected scale (B, H, T_feats).
Return type: Tensor

Run inference.

Parameters:
- text (LongTensor) – Batch of padded character ids (B, Tmax).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, Lmax, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- label (LongTensor) – Batch of padded label ids (B, Tmax).
- label_lengths (LongTensor) – Batch of the lengths of padded label ids (B, ).
- melody (LongTensor) – Batch of padded midi (B, Tmax).
- gt_dur (LongTensor) – Batch of padded ground truth duration (B, Tmax).
- score_dur (LongTensor) – Batch of padded score duration (B, Tmax).
- pitch (FloatTensor) – Batch of padded f0 (B, Tmax).
- ying (Optional *[*Tensor ]) – Batch of padded ying (B, Tmax).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- noise_scale (float) – Noise scale parameter for flow.
- noise_scale_dur (float) – Noise scale parameter for duration predictor.
- alpha (float) – Alpha parameter to control the speed of generated speech.
- max_len (Optional *[*int ]) – Maximum length of acoustic feature sequence.
- use_teacher_forcing (bool) – Whether to use teacher forcing.
Returns: Generated waveform tensor (B, T_wav).
Return type: Tensor