espnet2.gan_svs.vits.vits.VITS
espnet2.gan_svs.vits.vits.VITS
class espnet2.gan_svs.vits.vits.VITS(idim: int, odim: int, sampling_rate: int = 22050, generator_type: str = 'visinger', vocoder_generator_type: str = 'hifigan', generator_params: Dict[str, Any] = {'decoder_channels': 512, 'decoder_kernel_size': 7, 'decoder_resblock_dilations': [[1, 3, 5], [1, 3, 5], [1, 3, 5]], 'decoder_resblock_kernel_sizes': [3, 7, 11], 'decoder_upsample_kernel_sizes': [16, 16, 4, 4], 'decoder_upsample_scales': [8, 8, 2, 2], 'expand_f0_method': 'repeat', 'flow_base_dilation': 1, 'flow_dropout_rate': 0.0, 'flow_flows': 4, 'flow_kernel_size': 5, 'flow_layers': 4, 'global_channels': -1, 'hidden_channels': 192, 'hubert_channels': 0, 'langs': None, 'posterior_encoder_base_dilation': 1, 'posterior_encoder_dropout_rate': 0.0, 'posterior_encoder_kernel_size': 5, 'posterior_encoder_layers': 16, 'posterior_encoder_stacks': 1, 'projection_filters': [0, 1, 1, 1], 'projection_kernels': [0, 5, 7, 11], 'segment_size': 32, 'spk_embed_dim': None, 'spks': None, 'text_encoder_activation_type': 'swish', 'text_encoder_attention_dropout_rate': 0.0, 'text_encoder_attention_heads': 2, 'text_encoder_blocks': 6, 'text_encoder_conformer_kernel_size': 7, 'text_encoder_dropout_rate': 0.1, 'text_encoder_ffn_expand': 4, 'text_encoder_normalize_before': True, 'text_encoder_positional_dropout_rate': 0.0, 'text_encoder_positional_encoding_layer_type': 'rel_pos', 'text_encoder_positionwise_conv_kernel_size': 1, 'text_encoder_positionwise_layer_type': 'conv1d', 'text_encoder_self_attention_layer_type': 'rel_selfattn', 'use_conformer_conv_in_text_encoder': True, 'use_macaron_style_in_text_encoder': True, 'use_only_mean_in_flow': True, 'use_phoneme_predictor': False, 'use_weight_norm_in_decoder': True, 'use_weight_norm_in_flow': True, 'use_weight_norm_in_posterior_encoder': True}, discriminator_type: str = 'hifigan_multi_scale_multi_period_discriminator', discriminator_params: Dict[str, Any] = {'avocodo': {'combd': {'combd_d_d': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], 'combd_d_g': [[1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1], [1, 4, 16, 64, 256, 1]], 'combd_d_k': [[7, 11, 11, 11, 11, 5], [11, 21, 21, 21, 21, 5], [15, 41, 41, 41, 41, 5]], 'combd_d_p': [[3, 5, 5, 5, 5, 2], [5, 10, 10, 10, 10, 2], [7, 20, 20, 20, 20, 2]], 'combd_d_s': [[1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1], [1, 1, 4, 4, 4, 1]], 'combd_h_u': [[16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024], [16, 64, 256, 1024, 1024, 1024]], 'combd_op_f': [1, 1, 1], 'combd_op_g': [1, 1, 1], 'combd_op_k': [3, 3, 3]}, 'pqmf_config': {'lv1': [2, 256, 0.25, 10.0], 'lv2': [4, 192, 0.13, 10.0]}, 'sbd': {'pqmf_config': {'fsbd': [64, 256, 0.1, 9.0], 'sbd': [16, 256, 0.03, 10.0]}, 'sbd_band_ranges': [[0, 6], [0, 11], [0, 16], [0, 64]], 'sbd_dilations': [[[5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11], [5, 7, 11]], [[3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7], [3, 5, 7]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]], [[1, 2, 3], [1, 2, 3], [1, 2, 3], [2, 3, 5], [2, 3, 5]]], 'sbd_filters': [[64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [64, 128, 256, 256, 256], [32, 64, 128, 128, 128]], 'sbd_kernel_sizes': [[[7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7], [7, 7, 7]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]], [[3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3], [3, 3, 3]], [[5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5], [5, 5, 5]]], 'sbd_strides': [[1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1], [1, 1, 3, 3, 1]], 'sbd_transpose': [False, False, False, True], 'use_sbd': True}}, 'hifigan_multi_scale_multi_period_discriminator': {'follow_official_norm': False, 'period_discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'periods': [2, 3, 5, 7, 11], 'scale_discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [2, 2, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False, 'use_weight_norm': True}, 'scale_downsample_pooling': 'AvgPool1d', 'scale_downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'scales': 1}}, generator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, discriminator_adv_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'loss_type': 'mse'}, feat_match_loss_params: Dict[str, Any] = {'average_by_discriminators': False, 'average_by_layers': False, 'include_final_outputs': True}, mel_loss_params: Dict[str, Any] = {'fmax': None, 'fmin': 0, 'fs': 22050, 'hop_length': 256, 'log_base': None, 'n_fft': 1024, 'n_mels': 80, 'win_length': None, 'window': 'hann'}, lambda_adv: float = 1.0, lambda_mel: float = 45.0, lambda_feat_match: float = 2.0, lambda_dur: float = 0.1, lambda_kl: float = 1.0, lambda_pitch: float = 10.0, lambda_phoneme: float = 1.0, lambda_c_yin: float = 45.0, cache_generator_outputs: bool = True)
Bases: AbsGANSVS
VITS module (generator + discriminator).
This is a module of VITS described in Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.
This class integrates both a generator and a discriminator for the VITS model, enabling the generation of high-quality singing voice synthesis using adversarial training.
generator
The generator model for synthesizing audio.
- Type:VISingerGenerator
discriminator
The discriminator model for evaluating the generated audio.
- Type: Discriminator
lambda_adv
Coefficient for the adversarial loss.
- Type: float
lambda_mel
Coefficient for the mel spectrogram loss.
- Type: float
lambda_feat_match
Coefficient for the feature matching loss.
- Type: float
lambda_dur
Coefficient for duration loss.
- Type: float
lambda_kl
Coefficient for KL divergence loss.
- Type: float
lambda_pitch
Coefficient for pitch loss.
- Type: float
lambda_phoneme
Coefficient for phoneme loss.
- Type: float
lambda_c_yin
Coefficient for yin loss.
- Type: float
fs
Sampling rate for saving waveform during inference.
- Type: int
use_flow
Indicates whether to use flow in the generator.
- Type: bool
use_phoneme_predictor
Indicates whether to use a phoneme predictor.
- Type: bool
use_avocodo
Indicates whether to use Avocodo in the model.
Type: bool
Parameters:
- idim (int) – Input vocabulary size.
- odim (int) – Acoustic feature dimension. The actual output channels will be 1 since VITS is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.
- sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.
- generator_type (str) – Generator type.
- vocoder_generator_type (str) – Type of vocoder generator to use in the model.
- generator_params (Dict *[*str , Any ]) – Parameter dict for generator.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feat match loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_mel (float) – Loss scaling coefficient for mel spectrogram loss.
- lambda_feat_match (float) – Loss scaling coefficient for feat match loss.
- lambda_dur (float) – Loss scaling coefficient for duration loss.
- lambda_kl (float) – Loss scaling coefficient for KL divergence loss.
- lambda_pitch (float) – Loss scaling coefficient for pitch loss.
- lambda_phoneme (float) – Loss scaling coefficient for phoneme loss.
- lambda_c_yin (float) – Loss scaling coefficient for yin loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.
######### Examples
Initialize the VITS model
vits_model = VITS(idim=100, odim=80)
Perform inference
generated_waveform = vits_model.inference(text_tensor, feats_tensor)
NOTE
This implementation requires the PyTorch library and assumes that the user has a compatible environment set up for GAN training.
- Raises:ValueError – If the generator or discriminator type is not available.
Initialize VITS module.
- Parameters:
- idim (int) – Input vocabrary size.
- odim (int) – Acoustic feature dimension. The actual output channels will be 1 since VITS is the end-to-end text-to-wave model but for the compatibility odim is used to indicate the acoustic feature dimension.
- sampling_rate (int) – Sampling rate, not used for the training but it will be referred in saving waveform during the inference.
- generator_type (str) – Generator type.
- vocoder_generator_type (str) – Type of vocoder generator to use in the model.
- generator_params (Dict *[*str , Any ]) – Parameter dict for generator.
- discriminator_type (str) – Discriminator type.
- discriminator_params (Dict *[*str , Any ]) – Parameter dict for discriminator.
- generator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for generator adversarial loss.
- discriminator_adv_loss_params (Dict *[*str , Any ]) – Parameter dict for discriminator adversarial loss.
- feat_match_loss_params (Dict *[*str , Any ]) – Parameter dict for feat match loss.
- mel_loss_params (Dict *[*str , Any ]) – Parameter dict for mel loss.
- lambda_adv (float) – Loss scaling coefficient for adversarial loss.
- lambda_mel (float) – Loss scaling coefficient for mel spectrogram loss.
- lambda_feat_match (float) – Loss scaling coefficient for feat match loss.
- lambda_dur (float) – Loss scaling coefficient for duration loss.
- lambda_kl (float) – Loss scaling coefficient for KL divergence loss.
- lambda_pitch (float) – Loss scaling coefficient for pitch loss.
- lambda_phoneme (float) – Loss scaling coefficient for phoneme loss.
- lambda_c_yin (float) – Loss scaling coefficient for yin loss.
- cache_generator_outputs (bool) – Whether to cache generator outputs.
forward(text: Tensor, text_lengths: Tensor, feats: Tensor, feats_lengths: Tensor, singing: Tensor, singing_lengths: Tensor, ssl_feats: Tensor | None = None, ssl_feats_lengths: Tensor | None = None, label: Dict[str, Tensor] | None = None, label_lengths: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: LongTensor | None = None, ying: Tensor | None = None, duration: Dict[str, Tensor] | None = None, slur: LongTensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, forward_generator: bool = True) → Dict[str, Any]
Perform generator forward.
This method takes the input text and various features to produce the output of the generator or discriminator in the VITS model. It can perform both forward generator and discriminator passes based on the forward_generator flag.
- Parameters:
- text (LongTensor) – Batch of padded character ids (B, T_text).
- text_lengths (LongTensor) – Batch of lengths of each input batch (B,).
- feats (Tensor) – Batch of padded target features (B, T_feats, odim).
- feats_lengths (LongTensor) – Batch of the lengths of each target (B,).
- singing (Tensor) – Singing waveform tensor (B, T_wav).
- singing_lengths (Tensor) – Singing length tensor (B,).
- ssl_feats (Tensor) – SSL feature tensor (B, T_feats, hubert_channels).
- ssl_feats_lengths (Tensor) – SSL feature length tensor (B,).
- label (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded label ids (B, T_text).
- label_lengths (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of the lengths of padded label ids (B,).
- melody (Optional *[*Dict ]) – key is “lab” or “score”; value (LongTensor): Batch of padded melody (B, T_text).
- pitch (FloatTensor) – Batch of padded f0 (B, T_feats).
- ying (Optional *[*Tensor ]) – Batch of padded ying (B, T_feats).
- duration (Optional *[*Dict ]) – key is “lab”, “score_phn” or “score_syb”; value (LongTensor): Batch of padded duration (B, T_text).
- slur (FloatTensor) – Batch of padded slur (B, T_text).
- spembs (Optional *[*Tensor ]) – Batch of speaker embeddings (B, spk_embed_dim).
- sids (Optional *[*Tensor ]) – Batch of speaker IDs (B, 1).
- lids (Optional *[*Tensor ]) – Batch of language IDs (B, 1).
- forward_generator (bool) – Whether to forward generator.
- Returns:
- loss (Tensor): Loss scalar tensor.
- stats (Dict[str, float]): Statistics to be monitored.
- weight (Tensor): Weight tensor to summarize losses.
- optim_idx (int): Optimizer index (0 for G and 1 for D).
- Return type: Dict[str, Any]
######### Examples
>>> model = VITS(...)
>>> output = model.forward(
... text=text_tensor,
... text_lengths=text_lengths_tensor,
... feats=feats_tensor,
... feats_lengths=feats_lengths_tensor,
... singing=singing_tensor,
... singing_lengths=singing_lengths_tensor,
... forward_generator=True
... )
inference(text: Tensor, feats: Tensor | None = None, ssl_feats: Tensor | None = None, label: Dict[str, Tensor] | None = None, melody: Dict[str, Tensor] | None = None, pitch: Tensor | None = None, duration: Dict[str, Tensor] | None = None, slur: Dict[str, Tensor] | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, noise_scale: float = 0.667, noise_scale_dur: float = 0.8, alpha: float = 1.0, max_len: int | None = None, use_teacher_forcing: bool = False) → Dict[str, Tensor]
Run inference to generate a waveform from input text and features.
This method processes the input text, features, and optional parameters to produce a generated waveform using the VITS model.
- Parameters:
- text (Tensor) – Input text index tensor (T_text,).
- feats (Tensor) – Feature tensor (T_feats, aux_channels).
- ssl_feats (Tensor) – SSL Feature tensor (T_feats, hubert_channels).
- label (Optional *[*Dict ]) – Dictionary containing label data. Keys can be “lab” or “score”; values are LongTensors representing padded label ids (B, T_text).
- melody (Optional *[*Dict ]) – Dictionary containing melody data. Keys can be “lab” or “score”; values are LongTensors representing padded melody (B, T_text).
- pitch (FloatTensor) – Batch of padded f0 (B, T_feats).
- slur (LongTensor) – Batch of padded slur (B, T_text).
- sids (Tensor) – Speaker index tensor (1,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (spk_embed_dim,).
- lids (Tensor) – Language index tensor (1,).
- noise_scale (float) – Noise scale value for flow (default: 0.667).
- noise_scale_dur (float) – Noise scale value for duration predictor (default: 0.8).
- alpha (float) – Alpha parameter to control the speed of generated singing (default: 1.0).
- max_len (Optional *[*int ]) – Maximum length of the output (default: None).
- use_teacher_forcing (bool) – Whether to use teacher forcing during inference (default: False).
- duration (Optional *[*Dict ]) – Dictionary containing duration data. Keys can be “lab”, “score_phn” or “score_syb”; values are LongTensors representing padded duration (B, T_text).
- Returns: A dictionary containing the generated waveform tensor (T_wav,).
- Return type: Dict[str, Tensor]
######### Examples
>>> model = VITS(...)
>>> text = torch.tensor([1, 2, 3, 4]) # Example text input
>>> feats = torch.randn(10, 80) # Example features
>>> generated = model.inference(text, feats)
>>> waveform = generated['wav'] # Access the generated waveform
NOTE
Ensure that the input text and features are appropriately padded and shaped for the model to process correctly.
property require_raw_singing
Return whether or not raw_singing is required.
property require_vocoder
Return whether or not vocoder is required.