espnet2.tasks.tts.TTSTask
espnet2.tasks.tts.TTSTask
class espnet2.tasks.tts.TTSTask
Bases: AbsTask
TTSTask is a class that implements a text-to-speech (TTS) task for training
and evaluating TTS models.
This class provides methods to handle TTS-specific functionalities such as building models, processing data, and defining the required and optional data names for training and inference.
num_optimizers
The number of optimizers to be used in the task.
- Type: int
class_choices_list
A list of choices for various components like feature extraction, normalization, and TTS models.
- Type: list
trainer
The trainer class used for training the TTS model.
Type:Trainer
Parameters:parser (argparse.ArgumentParser) – The argument parser to add task related arguments.
Returns: A callable function that collates the input data.
Return type: Callable
Raises:RuntimeError – If the token list is not of the expected type.
################### Examples
To add task-specific arguments
TTSTask.add_task_arguments(parser)
To build a model
model = TTSTask.build_model(args)
To get required data names for training
required_names = TTSTask.required_data_names(train=True)
To build vocoder from a file
vocoder = TTSTask.build_vocoder_from_file(vocoder_config_file=’path/to/config.yaml’)
######### NOTE This class inherits from AbsTask and requires the implementation of abstract methods defined in the parent class.
classmethod add_task_arguments(parser: ArgumentParser)
Adds task-related arguments to the argument parser for the TTS task.
This method configures the argument parser with the necessary arguments for the text-to-speech task. It defines both task-specific and preprocessing arguments, which are essential for training and evaluating TTS models.
- Parameters:
- cls – The class itself, typically used in class methods.
- parser (argparse.ArgumentParser) – The argument parser to which the arguments will be added.
################### Examples
parser = argparse.ArgumentParser() TTSTask.add_task_arguments(parser)
This will add the following arguments to the parser:
–token_list
–odim
–model_conf
–use_preprocessor
–token_type
–bpemodel
–non_linguistic_symbols
–cleaner
–g2p
and others defined in class_choices_list.
######### NOTE The arguments for feature extraction, normalization, and TTS models are added from the class_choices_list.
- Raises:None – This method does not raise any exceptions.
classmethod build_collate_fn(args: Namespace, train: bool) → Callable[[Collection[Tuple[str, Dict[str, ndarray]]]], Tuple[List[str], Dict[str, Tensor]]]
Build a collate function for batching data.
This method constructs a collate function that is used to process batches of data. The collate function will handle padding and formatting of input data into a form suitable for training or evaluation.
- Parameters:
- args (argparse.Namespace) – The arguments parsed from the command line.
- train (bool) – Indicates whether the function is being used for training or evaluation.
- Returns: A collate function that takes a collection of data samples and returns a tuple containing a list of keys and a dictionary of tensors.
- Return type: Callable
################### Examples
>>> collate_fn = TTSTask.build_collate_fn(args, train=True)
>>> batch = collate_fn([("sample1", {"feature": np.array([1, 2])}),
... ("sample2", {"feature": np.array([3, 4])})])
>>> print(batch)
(['sample1', 'sample2'], {'feature': tensor([[1, 2], [3, 4]])})
######### NOTE This function uses CommonCollateFn to handle padding and data organization.
classmethod build_model(args: Namespace) → ESPnetTTSModel
Builds and returns an instance of the ESPnetTTSModel based on the provided
arguments.
This method configures various components of the text-to-speech (TTS) model, including feature extraction, normalization, and the TTS architecture itself. It also handles the token list for input processing.
Parameters:args (argparse.Namespace) –
A namespace object containing the necessary configurations for building the model, including:
- token_list (str or list): Path to a file or a list of tokens mapping
int-id to token.
- odim (int, optional): The number of dimensions of the output feature.
- feats_extract (str): Type of feature extractor to use.
- feats_extract_conf (dict): Configuration for the feature extractor.
- normalize (str, optional): Type of normalization to apply.
- normalize_conf (dict, optional): Configuration for normalization.
- tts (str): Type of TTS model to use.
- tts_conf (dict): Configuration for the TTS model.
- pitch_extract (str, optional): Type of pitch extractor to use.
- pitch_extract_conf (dict, optional): Configuration for pitch extractor.
- energy_extract (str, optional): Type of energy extractor to use.
- energy_extract_conf (dict, optional): Configuration for energy extractor.
- pitch_normalize (str, optional): Type of pitch normalization to apply.
- pitch_normalize_conf (dict, optional): Configuration for pitch normalization.
- energy_normalize (str, optional): Type of energy normalization to apply.
- energy_normalize_conf (dict, optional): Configuration for energy normalization.
- model_conf (dict, optional): Additional keyword arguments for the model class.
Returns: An instance of the configured ESPnetTTSModel.
Return type:ESPnetTTSModel
Raises:RuntimeError – If token_list is not a string or a list.
################### Examples
Building a model with a specified token list file
args = argparse.Namespace(
token_list=’path/to/token_list.txt’, odim=None, feats_extract=’fbank’, feats_extract_conf={‘param1’: value1}, normalize=’global_mvn’, tts=’tacotron2’, tts_conf={‘param2’: value2},
) model = TTSTask.build_model(args)
Building a model with a predefined list of tokens
args = argparse.Namespace(
token_list=[‘a’, ‘b’, ‘c’], odim=80, feats_extract=’spectrogram’, feats_extract_conf={‘param1’: value1}, normalize=None, tts=’fastspeech2’, tts_conf={‘param2’: value2},
) model = TTSTask.build_model(args)
classmethod build_preprocess_fn(args: Namespace, train: bool) → Callable[[str, Dict[str, array]], Dict[str, ndarray]] | None
Builds a preprocessing function based on the given arguments.
This function returns a callable that applies preprocessing to input text and its corresponding feature dictionary if preprocessing is enabled. If preprocessing is not required, it returns None.
- Parameters:
- cls – The class type for method resolution.
- args (argparse.Namespace) – The parsed arguments containing preprocessing options.
- train (bool) – Indicates whether the function is for training or inference.
- Returns: A function that takes a string and a dictionary of features, applying the specified preprocessing steps, or None if preprocessing is disabled.
- Return type: Optional[Callable[[str, Dict[str, np.array]], Dict[str, np.ndarray]]]
################### Examples
>>> from argparse import Namespace
>>> args = Namespace(use_preprocessor=True, token_type='phn', ...)
>>> preprocess_fn = TTSTask.build_preprocess_fn(args, train=True)
>>> features = preprocess_fn("sample text", {"feature_key": np.array([1, 2])})
######### NOTE The preprocessing function may include tokenization, text cleaning, and conversion to phonemes based on the specified arguments.
classmethod build_vocoder_from_file(vocoder_config_file: Path | str | None = None, vocoder_file: Path | str | None = None, model: ESPnetTTSModel | None = None, device: str = 'cpu')
Builds a vocoder from the specified configuration file or file path.
This method constructs a vocoder based on the provided vocoder configuration file and vocoder model file. If no vocoder model file is provided, it defaults to using the Griffin-Lim vocoder. If the vocoder model file has a “.pkl” extension, it is assumed to be a trained Parallel WaveGAN model.
vocoder_config_file
Path to the vocoder configuration file.
- Type: Union[Path, str]
vocoder_file
Path to the vocoder model file.
- Type: Union[Path, str]
model
The TTS model used for extracting features.
- Type: Optional[ESPnetTTSModel]
device
The device to load the model onto (default: “cpu”).
Type: str
Parameters:
- vocoder_config_file (Union *[*Path , str ]) – The path to the vocoder config file.
- vocoder_file (Union *[*Path , str ]) – The path to the vocoder model file.
- model (Optional [ESPnetTTSModel ]) – The TTS model to extract features from.
- device (str) – The device to load the model onto (default: “cpu”).
Returns: The constructed vocoder or None if the vocoder could not be built.
Return type: Optional[Union[Spectrogram2Waveform, ParallelWaveGANPretrainedVocoder]]
Raises:ValueError – If the vocoder_file format is not supported.
################### Examples
Build vocoder using a configuration file and model file
vocoder = TTSTask.build_vocoder_from_file(
vocoder_config_file=”path/to/vocoder_config.yaml”, vocoder_file=”path/to/vocoder_model.pkl”, model=my_tts_model, device=”cuda”
)
Build vocoder using only a configuration file, defaults to Griffin-Lim
vocoder = TTSTask.build_vocoder_from_file(
vocoder_config_file=”path/to/vocoder_config.yaml”, model=my_tts_model
)
class_choices_list
num_optimizers
classmethod optional_data_names(train: bool = True, inference: bool = False) → Tuple[str, ...]
Returns a tuple of optional data names used in the text-to-speech task.
The optional data names vary based on whether the task is in training or inference mode. During training, the following optional data names can be utilized: ‘spembs’, ‘durations’, ‘pitch’, ‘energy’, ‘sids’, ‘lids’. In inference mode, ‘spembs’, ‘speech’, ‘durations’, ‘pitch’, ‘energy’, ‘sids’, and ‘lids’ can be used.
- Parameters:
- train (bool) – Indicates if the task is in training mode. Defaults to True.
- inference (bool) – Indicates if the task is in inference mode. Defaults to False.
- Returns: A tuple containing the names of optional data.
- Return type: Tuple[str, …]
################### Examples
>>> TTSTask.optional_data_names(train=True, inference=False)
('spembs', 'durations', 'pitch', 'energy', 'sids', 'lids')
>>> TTSTask.optional_data_names(train=True, inference=True)
('spembs', 'speech', 'durations', 'pitch', 'energy', 'sids', 'lids')
classmethod required_data_names(train: bool = True, inference: bool = False) → Tuple[str, ...]
Text-to-speech task.
This class provides functionalities to manage the text-to-speech (TTS) task, including argument parsing, data handling, and model building.
num_optimizers
The number of optimizers used for training.
- Type: int
class_choices_list
A list of available class choices for features extraction, normalization, TTS, pitch extraction, and energy extraction.
Type: List[ClassChoices]
Parameters:
- train (bool) – Indicates whether the task is in training mode. Defaults to True.
- inference (bool) – Indicates whether the task is in inference mode. Defaults to False.
Returns: A tuple of required data names based on the mode.
Return type: Tuple[str, …]
################### Examples
>>> TTSTask.required_data_names(train=True, inference=False)
('text', 'speech')
>>> TTSTask.required_data_names(train=True, inference=True)
('text',)
######### NOTE In inference mode, only ‘text’ is required.
trainer
alias of Trainer