espnet2.gan_svs.vits.vits.VITS

About 8 min

espnet2.gan_svs.vits.vits.VITS

class espnet2.gan_svs.vits.vits.VITS(idim: int, odim: int, sampling_rate: int = 22050, generator_type: str = 'visinger', vocoder_generator_type: str = 'hifigan', generator_params: Dict[str, Any] = {'decoder_channels': 512, 'decoder_kernel_size': 7, 'decoder_resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'decoder_resblock_kernel_sizes': [3, 7, 11], 'decoder_upsample_kernel_sizes': [16, 16, 4, 4], 'decoder_upsample_scales': [8, 8, 2, 2], 'expand_f0_method': 'repeat', 'flow_base_dilation': 1, 'flow_dropout_rate': 0.0, 'flow_flows': 4, 'flow_kernel_size': 5, 'flow_layers': 4, 'global_channels': -1, 'hidden_channels': 192, 'hubert_channels': 0, 'langs': None, 'posterior_encoder_base_dilation': 1, 'posterior_encoder_dropout_rate': 0.0, 'posterior_encoder_kernel_size': 5, 'posterior_encoder_layers': 16, 'posterior_encoder_stacks': 1, 'projection_filters': [0, 1, 1, 1], 'projection_kernels': [0, 5, 7, 11], 'segment_size': 32, 'spk_embed_dim': None, 'spks': None, 'text_encoder_activation_type': 'swish', 'text_encoder_attention_dropout_rate': 0.0, 'text_encoder_attention_heads': 2, 'text_encoder_blocks': 6, 'text_encoder_conformer_kernel_size': 7, 'text_encoder_dropout_rate': 0.1, 'text_encoder_ffn_expand': 4, 'text_encoder_normalize_before': True, 'text_encoder_positional_dropout_rate': 0.0, 'text_encoder_positional_encoding_layer_type': 'rel_pos', 'text_encoder_positionwise_conv_kernel_size': 1, 'text_encoder_positionwise_layer_type': 'conv1d', 'text_encoder_self_attention_layer_type': 'rel_selfattn', 'use_conformer_conv_in_text_encoder': True, 'use_macaron_style_in_text_encoder': True, 'use_only_mean_in_flow': True, 'use_phoneme_predictor': False, 'use_weight_norm_in_decoder': True, 'use_weight_norm_in_flow': True, 'use_weight_norm_in_posterior_encoder': True}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'avocodo': {'combd': {'combd_d_d': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], 'combd_d_g': [[1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1]], 'combd_d_k': [[7, 11, 11, 11, 11, 5], [11, 21, 21, 21, 21, 5], [15, 41, 41, 41, 41, 5]], 'combd_d_p': [[3, 5, 5, 5, 5, 2], [5, 10, 10, 10, 10, 2], [7, 20, 20, 20, 20, 2]], 'combd_d_s': [[1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1]], 'combd_h_u': [[16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024]], 'combd_op_f': [1, 1, 1], 'combd_op_g': [1, 1, 1], 'combd_op_k': [3, 3, 3]}, 'pqmf_config': {'lv1': [2, 256, 0.25, 10.0], 'lv2': [4, 192, 0.13, 10.0]}, 'sbd': {'pqmf_config': {'fsbd': [64, 256, 0.1, 9.0], 'sbd': [16, 256, 0.03, 10.0]}, 'sbd_band_ranges': [[0, 6], [0, 11], [0, 16], [0, 64]], 'sbd_dilations': [[[5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11]], [[3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [2, 3, 5], [2, 3, 5]]], 'sbd_filters': [[64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [32, 64, 128, 128, 128]], 'sbd_kernel_sizes': [[[7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]], [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]]], 'sbd_strides': [[1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1]], 'sbd_transpose': [False, False, False, True], 'use_sbd': True}}, 'hifigan_multi_scale_multi_period_discriminator': {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_adv: float = 1.0, lambda_mel: float = 45.0, lambda_feat_match: float = 2.0, lambda_dur: float = 0.1, lambda_kl: float = 1.0, lambda_pitch: float = 10.0, lambda_phoneme: float = 1.0, lambda_c_yin: float = 45.0, cache_generator_outputs: bool = True)

Bases: AbsGANSVS

VITS module (generator + discriminator).

This is a module of VITS described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.

This class integrates both a generator and a discriminator for the VITS model, enabling the generation of high-quality singing voice synthesis using adversarial training.

generator

The generator model for synthesizing audio.

Type:VISingerGenerator

discriminator

The discriminator model for evaluating the generated audio.

Type: Discriminator

lambda_adv

Coefficient for the adversarial loss.

Type: float

lambda_mel

Coefficient for the mel spectrogram loss.

Type: float

lambda_feat_match

Coefficient for the feature matching loss.

Type: float

lambda_dur

Coefficient for duration loss.

Type: float

lambda_kl

Coefficient for KL divergence loss.

Type: float

lambda_pitch

Coefficient for pitch loss.

Type: float

lambda_phoneme

Coefficient for phoneme loss.

Type: float

lambda_c_yin

Coefficient for yin loss.

Type: float

Sampling rate for saving waveform during inference.

Type: int

use_flow

Indicates whether to use flow in the generator.

Type: bool

use_phoneme_predictor

Indicates whether to use a phoneme predictor.

Type: bool

use_avocodo

Indicates whether to use Avocodo in the model.

Type: bool
Parameters:
- idim (int) – Input vocabulary size.
- odim (int) – Acoustic feature dimension. The actual output channels will be 1 since VITS is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.
- sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.
- generator_type (str) – Generator type.
- vocoder_generator_type (str) – Type of vocoder generator to use in the model.
- generator_params (Dict *[*str , Any ]) – Parameter dict for generator.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feat match loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_mel (float) – Loss scaling coefficient for mel spectrogram loss.
- lambda_feat_match (float) – Loss scaling coefficient for feat match loss.
- lambda_dur (float) – Loss scaling coefficient for duration loss.
- lambda_kl (float) – Loss scaling coefficient for KL divergence loss.
- lambda_pitch (float) – Loss scaling coefficient for pitch loss.
- lambda_phoneme (float) – Loss scaling coefficient for phoneme loss.
- lambda_c_yin (float) – Loss scaling coefficient for yin loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.

######### Examples

Initialize the VITS model

vits_model = VITS(idim=100, odim=80)

Perform inference

generated_waveform = vits_model.inference(text_tensor, feats_tensor)

NOTE

This implementation requires the PyTorch library and assumes that the user has a compatible environment set up for GAN training.

Raises:ValueError – If the generator or discriminator type is not available.

Initialize VITS module.

Parameters:
- idim (int) – Input vocabrary size.
- odim (int) – Acoustic feature dimension. The actual output channels will be 1 since VITS is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.
- sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.
- generator_type (str) – Generator type.
- vocoder_generator_type (str) – Type of vocoder generator to use in the model.
- generator_params (Dict *[*str , Any ]) – Parameter dict for generator.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feat match loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_mel (float) – Loss scaling coefficient for mel spectrogram loss.
- lambda_feat_match (float) – Loss scaling coefficient for feat match loss.
- lambda_dur (float) – Loss scaling coefficient for duration loss.
- lambda_kl (float) – Loss scaling coefficient for KL divergence loss.
- lambda_pitch (float) – Loss scaling coefficient for pitch loss.
- lambda_phoneme (float) – Loss scaling coefficient for phoneme loss.
- lambda_c_yin (float) – Loss scaling coefficient for yin loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.

forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, singing: Tensor, singing_lengths: Tensor, ssl_feats: Tensor | None = None, ssl_feats_lengths: Tensor | None = None, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: LongTensor | None = None, ying: Tensor | None = None, duration: Dict[str, Tensor] | None = None, slur: LongTensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, forward_generator: bool = True) → Dict[str, Any]

Perform generator forward.

This method takes the input text and various features to produce the output of the generator or discriminator in the VITS model. It can perform both forward generator and discriminator passes based on the forward_generator flag.

Parameters:
- text (LongTensor) – Batch of padded character ids (B, T_text).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, T_feats, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- singing (Tensor) – Singing waveform tensor (B, T_wav).
- singing_lengths (Tensor) – Singing length tensor (B,).
- ssl_feats (Tensor) – SSL feature tensor (B, T_feats, hubert_channels).
- ssl_feats_lengths (Tensor) – SSL feature length tensor (B,).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, T_text).
- label_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B,).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, T_text).
- pitch (FloatTensor) – Batch of padded f0 (B, T_feats).
- ying (Optional *[*Tensor ]) – Batch of padded ying (B, T_feats).
- duration (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, T_text).
- slur (FloatTensor) – Batch of padded slur (B, T_text).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- forward_generator (bool) – Whether to forward generator.
Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
Return type: Dict[str, Any]

######### Examples

>>> model = VITS(...)
>>> output = model.forward(
...     text=text_tensor,
...     text_lengths=text_lengths_tensor,
...     feats=feats_tensor,
...     feats_lengths=feats_lengths_tensor,
...     singing=singing_tensor,
...     singing_lengths=singing_lengths_tensor,
...     forward_generator=True
... )

Run inference to generate a waveform from input text and features.

This method processes the input text, features, and optional parameters to produce a generated waveform using the VITS model.

Parameters:
- text (Tensor) – Input text index tensor (T_text,).
- feats (Tensor) – Feature tensor (T_feats, aux_channels).
- ssl_feats (Tensor) – SSL Feature tensor (T_feats, hubert_channels).
- label (Optional *[*Dict ]) – Dictionary containing label data. Keys can be “lab” or “score”; values are LongTensors representing padded label ids (B, T_text).
- melody (Optional *[*Dict ]) – Dictionary containing melody data. Keys can be “lab” or “score”; values are LongTensors representing padded melody (B, T_text).
- pitch (FloatTensor) – Batch of padded f0 (B, T_feats).
- slur (LongTensor) – Batch of padded slur (B, T_text).
- sids (Tensor) – Speaker index tensor (1,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (spk_embed_dim,).
- lids (Tensor) – Language index tensor (1,).
- noise_scale (float) – Noise scale value for flow (default: 0.667).
- noise_scale_dur (float) – Noise scale value for duration predictor (default: 0.8).
- alpha (float) – Alpha parameter to control the speed of generated singing (default: 1.0).
- max_len (Optional *[*int ]) – Maximum length of the output (default: None).
- use_teacher_forcing (bool) – Whether to use teacher forcing during inference (default: False).
- duration (Optional *[*Dict ]) – Dictionary containing duration data. Keys can be “lab”, “score_phn” or “score_syb”; values are LongTensors representing padded duration (B, T_text).
Returns: A dictionary containing the generated waveform tensor (T_wav,).
Return type: Dict[str, Tensor]

######### Examples

>>> model = VITS(...)
>>> text = torch.tensor([1, 2, 3, 4])  # Example text input
>>> feats = torch.randn(10, 80)  # Example features
>>> generated = model.inference(text, feats)
>>> waveform = generated['wav']  # Access the generated waveform

NOTE

Ensure that the input text and features are appropriately padded and shaped for the model to process correctly.

property require_raw_singing

Return whether or not raw_singing is required.

property require_vocoder

Return whether or not vocoder is required.