espnet2.train.deepspeed_trainer.DeepSpeedTrainer
espnet2.train.deepspeed_trainer.DeepSpeedTrainer
class espnet2.train.deepspeed_trainer.DeepSpeedTrainer
Bases: Trainer
DeepSpeed Trainer Module for training deep learning models using DeepSpeed.
This class extends the Trainer class and integrates with the DeepSpeed library to facilitate efficient training of models. It manages the training loop, validation, checkpointing, and resuming from checkpoints.
None
- Parameters:
- model (Union [AbsESPnetModel , DeepSpeedEngine ]) – The model to be trained.
- train_iter_factory (AbsIterFactory) – Factory to create training iterators.
- valid_iter_factory (AbsIterFactory) – Factory to create validation iterators.
- trainer_options (DeepSpeedTrainerOptions) – Options for the DeepSpeed trainer.
- **kwargs – Additional arguments.
- Returns: None
- Yields: None
- Raises:ImportError – If the DeepSpeed library is not installed.
################# Examples
Example usage of DeepSpeedTrainer
trainer = DeepSpeedTrainer() trainer.run(model, train_iter_factory, valid_iter_factory, trainer_options)
######## NOTE Ensure that the DeepSpeed library is installed in your environment.
classmethod build_options(args: Namespace) → DeepSpeedTrainerOptions
Build options for the DeepSpeedTrainer from command-line arguments.
This method constructs a DeepSpeedTrainerOptions instance, which contains various configuration settings necessary for training using the DeepSpeed library. It utilizes the provided command-line arguments parsed into an argparse.Namespace object.
- Parameters:
- cls – The class that calls this method (typically the DeepSpeedTrainer class).
- args (argparse.Namespace) – The command-line arguments parsed into a Namespace object.
- Returns: An instance of DeepSpeedTrainerOptions : containing the configuration options.
- Return type:DeepSpeedTrainerOptions
################# Examples
>>> import argparse
>>> parser = argparse.ArgumentParser()
>>> parser.add_argument('--resume', type=bool, default=False)
>>> parser.add_argument('--seed', type=int, default=42)
>>> parser.add_argument('--train_dtype', type=str, default='float32')
>>> parser.add_argument('--log_interval', type=int, default=10)
>>> parser.add_argument('--output_dir', type=str, default='./output')
>>> parser.add_argument('--max_epoch', type=int, default=100)
>>> parser.add_argument('--deepspeed_config', type=str, default='ds_config.json')
>>> args = parser.parse_args()
>>> options = DeepSpeedTrainer.build_options(args)
>>> print(options)
DeepSpeedTrainerOptions(resume=False, seed=42, train_dtype='float32',
log_interval=10, output_dir=Path('./output'),
max_epoch=100, deepspeed_config=Path('ds_config.json'))
DeepSpeed Trainer Module
This module provides the DeepSpeedTrainer class, which facilitates training using the DeepSpeed library. It includes functionality for building options, resuming training from checkpoints, and running training and validation epochs.
resume
Flag to indicate whether to resume training from a checkpoint.
- Type: bool
seed
Seed for random number generation.
- Type: int
train_dtype
Data type for training (e.g., float32).
- Type: Union[str, torch.dtype]
log_interval
Interval for logging training metrics.
- Type: Optional[int]
output_dir
Directory to save output and checkpoints.
- Type: Union[Path, str]
max_epoch
Maximum number of epochs for training.
- Type: int
deepspeed_config
Path to the DeepSpeed configuration file.
Type: Union[Path, str]
Parameters:
- model (DeepSpeedEngine) – The DeepSpeed model to be trained.
- reporter (Reporter) – The reporter instance for logging metrics.
- output_dir (Path) – Directory containing checkpoints to resume from.
Returns: This method does not return any value.
Return type: None
Raises:ImportError – If the DeepSpeed library is not installed.
################# Examples
To resume training from the latest checkpoint: : trainer.resume(model, reporter, output_dir)
To build options from command-line arguments: : options = DeepSpeedTrainer.build_options(args)
classmethod run(model: AbsESPnetModel | None, train_iter_factory: AbsIterFactory, valid_iter_factory: AbsIterFactory, trainer_options: DeepSpeedTrainerOptions, **kwargs) → None
Run the training and validation process for the DeepSpeedTrainer.
This method initializes the DeepSpeed engine, sets up the reporter, and orchestrates the training and validation loops for the specified number of epochs. It also handles checkpointing and resuming training if required.
- Parameters:
- model (Union [AbsESPnetModel , DeepSpeedEngine ]) – The model to be trained.
- train_iter_factory (AbsIterFactory) – Factory for creating training data iterators.
- valid_iter_factory (AbsIterFactory) – Factory for creating validation data iterators.
- trainer_options (DeepSpeedTrainerOptions) – Options containing training configurations such as max epochs, seed, etc.
- **kwargs – Additional keyword arguments (not used).
- Raises:ImportError – If the DeepSpeed package is not installed.
################# Examples
>>> from espnet2.train.deepspeed_trainer import DeepSpeedTrainer
>>> options = DeepSpeedTrainerOptions(
... resume=False,
... seed=42,
... train_dtype='fp16',
... log_interval=100,
... output_dir='output',
... max_epoch=10,
... deepspeed_config='ds_config.json'
... )
>>> trainer = DeepSpeedTrainer()
>>> trainer.run(model, train_iter_factory, valid_iter_factory, options)
classmethod setup_data_dtype(deepspeed_config: Dict)
Sets up the data type for training based on the DeepSpeed configuration.
This method determines the appropriate data type (dtype) for training based on the provided DeepSpeed configuration. It checks for specific keys in the configuration dictionary to decide between bfloat16, float16, or float32.
cls
The class method reference.
- Parameters:
- deepspeed_config (Dict) – A dictionary containing DeepSpeed configuration
- options
- "bf16" (which can include)
- "fp16"
- "amp". (or)
- Returns: The data type to be used for training.
- Return type: torch.dtype
################# Examples
>>> deepspeed_config = {"bf16": True}
>>> dtype = DeepSpeedTrainer.setup_data_dtype(deepspeed_config)
>>> print(dtype)
torch.bfloat16
>>> deepspeed_config = {"fp16": True}
>>> dtype = DeepSpeedTrainer.setup_data_dtype(deepspeed_config)
>>> print(dtype)
torch.float16
>>> deepspeed_config = {}
>>> dtype = DeepSpeedTrainer.setup_data_dtype(deepspeed_config)
>>> print(dtype)
torch.float
######## NOTE The method checks for the presence of “bf16”, “fp16”, and “amp” keys in the configuration. The choice of dtype may depend on the capabilities of the underlying hardware.
classmethod train_one_epoch(model, iterator: Iterable[Tuple[List[str], Dict[str, Tensor]]], reporter: SubReporter, options: DeepSpeedTrainerOptions) → None
Train the model for one epoch using the provided data iterator.
This method handles the training loop for a single epoch, performing forward and backward passes through the model, logging statistics, and updating model parameters. It utilizes distributed training techniques to ensure synchronization across multiple devices.
model
The model to be trained.
- Type: DeepSpeedEngine
iterator
An iterable that provides batches of training data.
- Type: Iterable[Tuple[List[str], Dict[str, torch.Tensor]]]
reporter
An object for logging and reporting training metrics.
- Type:SubReporter
options
Options that configure the training process.
Parameters:
- cls – The class reference.
- model – The model to train, expected to be a DeepSpeedEngine instance.
- iterator – An iterable that yields tuples containing utterance IDs and batches of data.
- reporter – An instance of SubReporter for logging purposes.
- options – A DeepSpeedTrainerOptions instance containing training configuration options.
Returns: This method does not return any value.
Return type: None
Raises:AssertionError – If the batch is not a dictionary.
################# Examples
>>> trainer.train_one_epoch(model, data_iterator, reporter, options)
######## NOTE This method is designed to work in a distributed training setup where multiple processes may be running concurrently. It ensures that all processes synchronize at certain points to maintain consistency in training.
classmethod valid_one_epoch(model, iterator: Iterable[Tuple[List[str], Dict[str, Tensor]]], reporter: SubReporter, options: DeepSpeedTrainerOptions) → None
Validates the model for one epoch.
This method evaluates the model’s performance on the validation dataset for one epoch. It computes the loss and statistics while ensuring that all distributed ranks are synchronized during the validation process.
- Parameters:
- model (Union [AbsESPnetModel , DeepSpeedEngine ]) – The model to be validated.
- iterator (Iterable *[*Tuple *[*List *[*str ] , Dict *[*str , torch.Tensor ] ] ]) – An iterator that provides batches of validation data, where each batch is a tuple containing utterance IDs and a dictionary of tensors.
- reporter (SubReporter) – An object responsible for reporting metrics and statistics during the validation process.
- options (DeepSpeedTrainerOptions) – Options for the DeepSpeed trainer, including data types and configuration settings.
- Yields: None
- Raises:None –
################# Examples
>>> from my_package import MyModel, MyDataLoader
>>> model = MyModel()
>>> valid_iterator = MyDataLoader()
>>> reporter = SubReporter()
>>> options = DeepSpeedTrainerOptions(...)
>>> DeepSpeedTrainer.valid_one_epoch(model, valid_iterator, reporter, options)
######## NOTE This method is designed to work in a distributed environment where synchronization between ranks is necessary. It will stop processing if any rank has completed its validation.