espnet2.tts.espnet_model.ESPnetTTSModel

About 4 min

espnet2.tts.espnet_model.ESPnetTTSModel

class espnet2.tts.espnet_model.ESPnetTTSModel(feats_extract: AbsFeatsExtract | None, pitch_extract: AbsFeatsExtract | None, energy_extract: AbsFeatsExtract | None, normalize: InversibleInterface | None, pitch_normalize: InversibleInterface | None, energy_normalize: InversibleInterface | None, tts: AbsTTS)

Bases: AbsESPnetModel

ESPnet model for text-to-speech task.

This class implements a text-to-speech (TTS) model using the ESPnet framework. It provides methods for forward propagation, feature extraction, and inference to generate speech from text input.

feats_extract

Feature extraction module for audio.

Type: Optional[AbsFeatsExtract]

pitch_extract

Feature extraction module for pitch.

Type: Optional[AbsFeatsExtract]

energy_extract

Feature extraction module for energy.

Type: Optional[AbsFeatsExtract]

normalize

Normalization module for audio features.

Type: Optional[AbsNormalize and InversibleInterface]

pitch_normalize

Normalization module for pitch features.

Type: Optional[AbsNormalize and InversibleInterface]

energy_normalize

Normalization module for energy features.

Type: Optional[AbsNormalize and InversibleInterface]

tts

Main TTS module that generates speech from features.

Type:AbsTTS
Parameters:
- feats_extract (Optional [AbsFeatsExtract ]) – Feature extraction module for audio.
- pitch_extract (Optional [AbsFeatsExtract ]) – Feature extraction module for pitch.
- energy_extract (Optional [AbsFeatsExtract ]) – Feature extraction module for energy.
- normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Normalization module for audio features.
- pitch_normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Normalization module for pitch features.
- energy_normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Normalization module for energy features.
- tts (AbsTTS) – Main TTS module that generates speech from features.
Returns: The constructor does not return any value.
Return type: None

######### Examples

Initialize the ESPnet TTS model: : ```python

model = ESPnetTTSModel(feats_extract=my_feats_extract, ... pitch_extract=my_pitch_extract, ... energy_extract=my_energy_extract, ... normalize=my_normalize, ... pitch_normalize=my_pitch_normalize, ... energy_normalize=my_energy_normalize, ... tts=my_tts)


Forward pass through the model:
: ```python
>>> loss, stats, weight = model.forward(text, text_lengths, speech,
...                                      speech_lengths)

Feature extraction: : ```python

feats_dict = model.collect_feats(text, text_lengths, speech, ... speech_lengths)


Inference:
: ```python
>>> output_dict = model.inference(text, speech=speech)

Raises:RuntimeError – If required arguments are missing during inference.

Initialize ESPnetTTSModel module.

Caclualte features and return them as a dict.

Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- speech (Tensor) – Speech waveform tensor (B, T_wav).
- speech_lengths (Tensor) – Speech length tensor (B,).
- durations (Optional *[*Tensor) – Duration tensor.
- durations_lengths (Optional *[*Tensor) – Duration length tensor (B,).
- pitch (Optional *[*Tensor) – Pitch tensor.
- pitch_lengths (Optional *[*Tensor) – Pitch length tensor (B,).
- energy (Optional *[*Tensor) – Energy tensor.
- energy_lengths (Optional *[*Tensor) – Energy length tensor (B,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, D).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (B, 1).
- lids (Optional *[*Tensor ]) – Language ID tensor (B, 1).
Returns: Dict of features.
Return type: Dict[str, Tensor]

Calculate outputs and return the loss tensor.

This method processes input tensors representing text and speech, extracts necessary features, normalizes them if required, and computes the loss alongside any relevant statistics.

Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- speech (Tensor) – Speech waveform tensor (B, T_wav).
- speech_lengths (Tensor) – Speech length tensor (B,).
- durations (Optional *[*Tensor ]) – Duration tensor (B,).
- durations_lengths (Optional *[*Tensor ]) – Duration length tensor (B,).
- pitch (Optional *[*Tensor ]) – Pitch tensor (B,).
- pitch_lengths (Optional *[*Tensor ]) – Pitch length tensor (B,).
- energy (Optional *[*Tensor ]) – Energy tensor (B,).
- energy_lengths (Optional *[*Tensor ]) – Energy length tensor (B,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, D).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (B, 1).
- lids (Optional *[*Tensor ]) – Language ID tensor (B, 1).
- kwargs – Additional arguments; “utt_id” is among the input.
Returns: Loss scalar tensor. Dict[str, float]: Statistics to be monitored. Tensor: Weight tensor to summarize losses.
Return type: Tensor

######### Examples

>>> model = ESPnetTTSModel(...)
>>> text = torch.tensor([[1, 2, 3], [4, 5, 6]])
>>> text_lengths = torch.tensor([3, 3])
>>> speech = torch.randn(2, 16000)  # Example speech data
>>> speech_lengths = torch.tensor([16000, 16000])
>>> loss, stats, weights = model.forward(text, text_lengths, speech,
...                                      speech_lengths)

Calculate features and return them as a dict.

This method performs inference using the text input and optionally uses other auxiliary features such as speech waveform, speaker embeddings, speaker IDs, language IDs, durations, pitch, and energy. It prepares the input for the text-to-speech model and returns the output dictionary containing generated features.

Parameters:
- text (Tensor) – Text index tensor (T_text).
- speech (Optional *[*Tensor ]) – Speech waveform tensor (T_wav).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (D,).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (1,).
- lids (Optional *[*Tensor ]) – Language ID tensor (1,).
- durations (Optional *[*Tensor ]) – Duration tensor.
- pitch (Optional *[*Tensor ]) – Pitch tensor.
- energy (Optional *[*Tensor ]) – Energy tensor.
- **decode_config – Additional decoding configuration options.
Returns: Dictionary of outputs, which may include generated features such as “feat_gen” and any other relevant outputs from the TTS model.
Return type: Dict[str, Tensor]
Raises:
- RuntimeError – If ‘speech’ is required but not provided when
- using teacher forcing. –

######### Examples

>>> model = ESPnetTTSModel(...)
>>> text_tensor = torch.tensor([[1, 2, 3]])
>>> output = model.inference(text_tensor)
>>> print(output.keys())
dict_keys(['feat_gen', ...])

NOTE

If normalization is applied to the features, the inverse normalization is also performed on the generated features before returning them.