espnet2.tts.espnet_model.ESPnetTTSModel
espnet2.tts.espnet_model.ESPnetTTSModel
class espnet2.tts.espnet_model.ESPnetTTSModel(feats_extract: AbsFeatsExtract | None, pitch_extract: AbsFeatsExtract | None, energy_extract: AbsFeatsExtract | None, normalize: InversibleInterface | None, pitch_normalize: InversibleInterface | None, energy_normalize: InversibleInterface | None, tts: AbsTTS)
Bases: AbsESPnetModel
ESPnet model for text-to-speech task.
This class implements a text-to-speech (TTS) model using the ESPnet framework. It provides methods for forward propagation, feature extraction, and inference to generate speech from text input.
feats_extract
Feature extraction module for audio.
- Type: Optional[AbsFeatsExtract]
pitch_extract
Feature extraction module for pitch.
- Type: Optional[AbsFeatsExtract]
energy_extract
Feature extraction module for energy.
- Type: Optional[AbsFeatsExtract]
normalize
Normalization module for audio features.
- Type: Optional[AbsNormalize and InversibleInterface]
pitch_normalize
Normalization module for pitch features.
- Type: Optional[AbsNormalize and InversibleInterface]
energy_normalize
Normalization module for energy features.
- Type: Optional[AbsNormalize and InversibleInterface]
tts
Main TTS module that generates speech from features.
Type:AbsTTS
Parameters:
- feats_extract (Optional [AbsFeatsExtract ]) – Feature extraction module for audio.
- pitch_extract (Optional [AbsFeatsExtract ]) – Feature extraction module for pitch.
- energy_extract (Optional [AbsFeatsExtract ]) – Feature extraction module for energy.
- normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Normalization module for audio features.
- pitch_normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Normalization module for pitch features.
- energy_normalize (Optional *[*AbsNormalize and InversibleInterface ]) – Normalization module for energy features.
- tts (AbsTTS) – Main TTS module that generates speech from features.
Returns: The constructor does not return any value.
Return type: None
######### Examples
Initialize the ESPnet TTS model: : ```python
model = ESPnetTTSModel(feats_extract=my_feats_extract, ... pitch_extract=my_pitch_extract, ... energy_extract=my_energy_extract, ... normalize=my_normalize, ... pitch_normalize=my_pitch_normalize, ... energy_normalize=my_energy_normalize, ... tts=my_tts)
Forward pass through the model:
: ```python
>>> loss, stats, weight = model.forward(text, text_lengths, speech,
... speech_lengths)
Feature extraction: : ```python
feats_dict = model.collect_feats(text, text_lengths, speech, ... speech_lengths)
Inference:
: ```python
>>> output_dict = model.inference(text, speech=speech)
- Raises:RuntimeError – If required arguments are missing during inference.
Initialize ESPnetTTSModel module.
collect_feats(text: Tensor, text_lengths: Tensor, speech: Tensor, speech_lengths: Tensor, durations: Tensor | None = None, durations_lengths: Tensor | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, energy: Tensor | None = None, energy_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, **kwargs) → Dict[str, Tensor]
Caclualte features and return them as a dict.
- Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- speech (Tensor) – Speech waveform tensor (B, T_wav).
- speech_lengths (Tensor) – Speech length tensor (B,).
- durations (Optional *[*Tensor) – Duration tensor.
- durations_lengths (Optional *[*Tensor) – Duration length tensor (B,).
- pitch (Optional *[*Tensor) – Pitch tensor.
- pitch_lengths (Optional *[*Tensor) – Pitch length tensor (B,).
- energy (Optional *[*Tensor) – Energy tensor.
- energy_lengths (Optional *[*Tensor) – Energy length tensor (B,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, D).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (B, 1).
- lids (Optional *[*Tensor ]) – Language ID tensor (B, 1).
- Returns: Dict of features.
- Return type: Dict[str, Tensor]
forward(text: Tensor, text_lengths: Tensor, speech: Tensor, speech_lengths: Tensor, durations: Tensor | None = None, durations_lengths: Tensor | None = None, pitch: Tensor | None = None, pitch_lengths: Tensor | None = None, energy: Tensor | None = None, energy_lengths: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, **kwargs) → Tuple[Tensor, Dict[str, Tensor], Tensor]
Calculate outputs and return the loss tensor.
This method processes input tensors representing text and speech, extracts necessary features, normalizes them if required, and computes the loss alongside any relevant statistics.
- Parameters:
- text (Tensor) – Text index tensor (B, T_text).
- text_lengths (Tensor) – Text length tensor (B,).
- speech (Tensor) – Speech waveform tensor (B, T_wav).
- speech_lengths (Tensor) – Speech length tensor (B,).
- durations (Optional *[*Tensor ]) – Duration tensor (B,).
- durations_lengths (Optional *[*Tensor ]) – Duration length tensor (B,).
- pitch (Optional *[*Tensor ]) – Pitch tensor (B,).
- pitch_lengths (Optional *[*Tensor ]) – Pitch length tensor (B,).
- energy (Optional *[*Tensor ]) – Energy tensor (B,).
- energy_lengths (Optional *[*Tensor ]) – Energy length tensor (B,).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (B, D).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (B, 1).
- lids (Optional *[*Tensor ]) – Language ID tensor (B, 1).
- kwargs – Additional arguments; “utt_id” is among the input.
- Returns: Loss scalar tensor. Dict[str, float]: Statistics to be monitored. Tensor: Weight tensor to summarize losses.
- Return type: Tensor
######### Examples
>>> model = ESPnetTTSModel(...)
>>> text = torch.tensor([[1, 2, 3], [4, 5, 6]])
>>> text_lengths = torch.tensor([3, 3])
>>> speech = torch.randn(2, 16000) # Example speech data
>>> speech_lengths = torch.tensor([16000, 16000])
>>> loss, stats, weights = model.forward(text, text_lengths, speech,
... speech_lengths)
inference(text: Tensor, speech: Tensor | None = None, spembs: Tensor | None = None, sids: Tensor | None = None, lids: Tensor | None = None, durations: Tensor | None = None, pitch: Tensor | None = None, energy: Tensor | None = None, **decode_config) → Dict[str, Tensor]
Calculate features and return them as a dict.
This method performs inference using the text input and optionally uses other auxiliary features such as speech waveform, speaker embeddings, speaker IDs, language IDs, durations, pitch, and energy. It prepares the input for the text-to-speech model and returns the output dictionary containing generated features.
- Parameters:
- text (Tensor) – Text index tensor (T_text).
- speech (Optional *[*Tensor ]) – Speech waveform tensor (T_wav).
- spembs (Optional *[*Tensor ]) – Speaker embedding tensor (D,).
- sids (Optional *[*Tensor ]) – Speaker ID tensor (1,).
- lids (Optional *[*Tensor ]) – Language ID tensor (1,).
- durations (Optional *[*Tensor ]) – Duration tensor.
- pitch (Optional *[*Tensor ]) – Pitch tensor.
- energy (Optional *[*Tensor ]) – Energy tensor.
- **decode_config – Additional decoding configuration options.
- Returns: Dictionary of outputs, which may include generated features such as “feat_gen” and any other relevant outputs from the TTS model.
- Return type: Dict[str, Tensor]
- Raises:
- RuntimeError – If ‘speech’ is required but not provided when
- using teacher forcing. –
######### Examples
>>> model = ESPnetTTSModel(...)
>>> text_tensor = torch.tensor([[1, 2, 3]])
>>> output = model.inference(text_tensor)
>>> print(output.keys())
dict_keys(['feat_gen', ...])
NOTE
If normalization is applied to the features, the inverse normalization is also performed on the generated features before returning them.