espnet2.tts2.espnet_model.ESPnetTTS2Model
espnet2.tts2.espnet_model.ESPnetTTS2Model
class espnet2.tts2.espnet_model.ESPnetTTS2Model(discrete_feats_extract: AbsFeatsExtractDiscrete, pitch_extract: AbsFeatsExtract | None, energy_extract: AbsFeatsExtract | None, pitch_normalize: InversibleInterface | None, energy_normalize: InversibleInterface | None, tts: AbsTTS2)
Bases: AbsESPnetModel
ESPnet model for text-to-speech task.
This class implements a text-to-speech (TTS) model using the ESPnet framework. It integrates feature extraction for pitch and energy, as well as normalization and synthesis of speech from text inputs.
discrete_feats_extract
Feature extractor for discrete speech features.
pitch_extract
Feature extractor for pitch.
- Type: Optional[AbsFeatsExtract]
energy_extract
Feature extractor for energy.
- Type: Optional[AbsFeatsExtract]
pitch_normalize
Normalizer for pitch features.
- Type: Optional[AbsNormalize and InversibleInterface]
energy_normalize
Normalizer for energy features.
- Type: Optional[AbsNormalize and InversibleInterface]
tts
Text-to-speech synthesis module.
Type:AbsTTS2
Parameters:
- discrete_feats_extract (AbsFeatsExtractDiscrete) – Feature extractor for discrete speech.
- pitch_extract (Optional [AbsFeatsExtract ]) – Feature extractor for pitch.
- energy_extract (Optional [AbsFeatsExtract ]) – Feature extractor for energy.
- pitch_normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Normalizer for pitch.
- energy_normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Normalizer for energy.
- tts (AbsTTS2) – TTS synthesis module.
######### Examples
Example of creating an ESPnetTTS2Model instance
model = ESPnetTTS2Model(discrete_feats_extract, pitch_extract, energy_extract,
pitch_normalize, energy_normalize, tts)
Example of using the forward method
loss, stats, weight = model.forward(text_tensor, text_lengths_tensor,
discrete_speech_tensor, discrete_speech_lengths_tensor, speech_tensor, speech_lengths_tensor)
Example of using the inference method
output = model.inference(text_tensor, speech=speech_tensor)
####### NOTE The model requires various feature extractors and normalizers which must be implemented separately. Ensure that all dependencies are satisfied.
Initialize ESPnetTTSModel module.
collect_feats(text: Tensor, text_lengths: Tensor, discrete_speech: Tensor, discrete_speech_lengths: Tensor, speech: Tensor, speech_lengths: Tensor, durations: Tensor | None = None, durations_lengths: Tensor | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, energy: Tensor | None = None, energy_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, **kwargs) → Dict[str, Tensor]
Caclualte features and return them as a dict.
- Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- speech (Tensor) – Speech waveform tensor (B, T_wav).
- speech_lengths (Tensor) – Speech length tensor (B,).
- discrete_speech (Tensor) – Discrete speech tensor (B, T_token).
- discrete_speech_lengths (Tensor) – Discrete speech length tensor (B,).
- durations (Optional *[*Tensor) – Duration tensor.
- durations_lengths (Optional *[*Tensor) – Duration length tensor (B,).
- pitch (Optional *[*Tensor) – Pitch tensor.
- pitch_lengths (Optional *[*Tensor) – Pitch length tensor (B,).
- energy (Optional *[*Tensor) – Energy tensor.
- energy_lengths (Optional *[*Tensor) – Energy length tensor (B,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, D).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (B, 1).
- lids (Optional *[*Tensor ]) – Language ID tensor (B, 1).
- Returns: Dict of features.
- Return type: Dict[str, Tensor]
forward(text: Tensor, text_lengths: Tensor, discrete_speech: Tensor, discrete_speech_lengths: Tensor, speech: Tensor, speech_lengths: Tensor, durations: Tensor | None = None, durations_lengths: Tensor | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, energy: Tensor | None = None, energy_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, **kwargs) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Calculate outputs and return the loss tensor.
This method processes input tensors related to text and speech and computes the necessary outputs, including loss and statistics for monitoring during training. It handles both auxiliary features such as pitch and energy, as well as discrete features extracted from the speech waveform.
- Parameters:
- text (torch.Tensor) – Text index tensor (B, T_text).
- text_lengths (torch.Tensor) – Text length tensor (B,).
- discrete_speech (torch.Tensor) – Discrete speech tensor (B, T_token).
- discrete_speech_lengths (torch.Tensor) – Discrete speech length tensor (B,).
- speech (torch.Tensor) – Speech waveform tensor (B, T_wav).
- speech_lengths (torch.Tensor) – Speech length tensor (B,).
- durations (Optional *[*torch.Tensor ]) – Duration tensor (B,).
- durations_lengths (Optional *[*torch.Tensor ]) – Duration length tensor (B,).
- pitch (Optional *[*torch.Tensor ]) – Pitch tensor (B, T_pitch).
- pitch_lengths (Optional *[*torch.Tensor ]) – Pitch length tensor (B,).
- energy (Optional *[*torch.Tensor ]) – Energy tensor (B, T_energy).
- energy_lengths (Optional *[*torch.Tensor ]) – Energy length tensor (B,).
- spembs (Optional *[*torch.Tensor ]) – Speaker embedding tensor (B, D).
- sids (Optional *[*torch.Tensor ]) – Speaker ID tensor (B, 1).
- lids (Optional *[*torch.Tensor ]) – Language ID tensor (B, 1).
- kwargs – Additional arguments, including “utt_id”.
- Returns:
- Loss scalar tensor.
- A dictionary containing statistics to be monitored.
- Weight tensor to summarize losses.
- Return type: Tuple[torch.Tensor, Dict[str, torch.Tensor], torch.Tensor]
######### Examples
>>> model = ESPnetTTS2Model(...)
>>> text = torch.randint(0, 100, (2, 10))
>>> text_lengths = torch.tensor([10, 9])
>>> discrete_speech = torch.randint(0, 50, (2, 20))
>>> discrete_speech_lengths = torch.tensor([20, 18])
>>> speech = torch.randn(2, 16000)
>>> speech_lengths = torch.tensor([16000, 14000])
>>> outputs = model.forward(text, text_lengths, discrete_speech,
... discrete_speech_lengths, speech,
... speech_lengths)
####### NOTE Ensure that all input tensors are correctly shaped and contain valid data types as expected by the model.
inference(text: Tensor, speech: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, durations: Tensor | None = None, pitch: Tensor | None = None, energy: Tensor | None = None, **decode_config) → Dict[str, Tensor]
Caclualte features and return them as a dict.
- Parameters:
- text (Tensor) – Text index tensor (T_text).
- speech (Tensor) – Speech waveform tensor (T_wav).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (D,).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (1,).
- lids (Optional *[*Tensor ]) – Language ID tensor (1,).
- durations (Optional *[*Tensor ]) – Duration tensor.
- pitch (Optional *[*Tensor ]) – Pitch tensor.
- energy (Optional *[*Tensor ]) – Energy tensor.
- Returns: Dict of outputs.
- Return type: Dict[str, Tensor]
######### Examples
>>> model = ESPnetTTS2Model(...)
>>> text = torch.tensor([...]) # Example text tensor
>>> output = model.inference(text)
>>> print(output.keys())
dict_keys(['feat_gen', 'other_output_keys'])
####### NOTE Ensure that the input tensors are properly shaped and normalized as required by the model.