espnet2.train.deepspeed_trainer.DeepSpeedTrainer

About 5 min

espnet2.train.deepspeed_trainer.DeepSpeedTrainer

class espnet2.train.deepspeed_trainer.DeepSpeedTrainer

Bases: Trainer

DeepSpeed Trainer Module for training deep learning models using DeepSpeed.

This class extends the Trainer class and integrates with the DeepSpeed library to facilitate efficient training of models. It manages the training loop, validation, checkpointing, and resuming from checkpoints.

None

Parameters:
- model (Union [AbsESPnetModel , DeepSpeedEngine ]) – The model to be trained.
- train_iter_factory (AbsIterFactory) – Factory to create training iterators.
- valid_iter_factory (AbsIterFactory) – Factory to create validation iterators.
- trainer_options (DeepSpeedTrainerOptions) – Options for the DeepSpeed trainer.
- **kwargs – Additional arguments.
Returns: None
Yields: None
Raises:ImportError – If the DeepSpeed library is not installed.

################# Examples

Example usage of DeepSpeedTrainer

trainer = DeepSpeedTrainer() trainer.run(model, train_iter_factory, valid_iter_factory, trainer_options)

######## NOTE Ensure that the DeepSpeed library is installed in your environment.

classmethod build_options(args: Namespace) → DeepSpeedTrainerOptions

Build options for the DeepSpeedTrainer from command-line arguments.

This method constructs a DeepSpeedTrainerOptions instance, which contains various configuration settings necessary for training using the DeepSpeed library. It utilizes the provided command-line arguments parsed into an argparse.Namespace object.

Parameters:
- cls – The class that calls this method (typically the DeepSpeedTrainer class).
- args (argparse.Namespace) – The command-line arguments parsed into a Namespace object.
Returns: An instance of DeepSpeedTrainerOptions : containing the configuration options.
Return type:DeepSpeedTrainerOptions

################# Examples

>>> import argparse
>>> parser = argparse.ArgumentParser()
>>> parser.add_argument('--resume', type=bool, default=False)
>>> parser.add_argument('--seed', type=int, default=42)
>>> parser.add_argument('--train_dtype', type=str, default='float32')
>>> parser.add_argument('--log_interval', type=int, default=10)
>>> parser.add_argument('--output_dir', type=str, default='./output')
>>> parser.add_argument('--max_epoch', type=int, default=100)
>>> parser.add_argument('--deepspeed_config', type=str, default='ds_config.json')
>>> args = parser.parse_args()
>>> options = DeepSpeedTrainer.build_options(args)
>>> print(options)
DeepSpeedTrainerOptions(resume=False, seed=42, train_dtype='float32',
                        log_interval=10, output_dir=Path('./output'),
                        max_epoch=100, deepspeed_config=Path('ds_config.json'))

static resume(model: None, reporter: Reporter, output_dir: Path)

DeepSpeed Trainer Module

This module provides the DeepSpeedTrainer class, which facilitates training using the DeepSpeed library. It includes functionality for building options, resuming training from checkpoints, and running training and validation epochs.

resume

Flag to indicate whether to resume training from a checkpoint.

Type: bool

seed

Seed for random number generation.

Type: int

train_dtype

Data type for training (e.g., float32).

Type: Union[str, torch.dtype]

log_interval

Interval for logging training metrics.

Type: Optional[int]

output_dir

Directory to save output and checkpoints.

Type: Union[Path, str]

max_epoch

Maximum number of epochs for training.

Type: int

deepspeed_config

Path to the DeepSpeed configuration file.

Type: Union[Path, str]
Parameters:
- model (DeepSpeedEngine) – The DeepSpeed model to be trained.
- reporter (Reporter) – The reporter instance for logging metrics.
- output_dir (Path) – Directory containing checkpoints to resume from.
Returns: This method does not return any value.
Return type: None
Raises:ImportError – If the DeepSpeed library is not installed.

################# Examples

To resume training from the latest checkpoint: : trainer.resume(model, reporter, output_dir)

To build options from command-line arguments: : options = DeepSpeedTrainer.build_options(args)

classmethod run(model: AbsESPnetModel | None, train_iter_factory: AbsIterFactory, valid_iter_factory: AbsIterFactory, trainer_options: DeepSpeedTrainerOptions, **kwargs) → None

Run the training and validation process for the DeepSpeedTrainer.

This method initializes the DeepSpeed engine, sets up the reporter, and orchestrates the training and validation loops for the specified number of epochs. It also handles checkpointing and resuming training if required.

Parameters:
- model (Union [AbsESPnetModel , DeepSpeedEngine ]) – The model to be trained.
- train_iter_factory (AbsIterFactory) – Factory for creating training data iterators.
- valid_iter_factory (AbsIterFactory) – Factory for creating validation data iterators.
- trainer_options (DeepSpeedTrainerOptions) – Options containing training configurations such as max epochs, seed, etc.
- **kwargs – Additional keyword arguments (not used).
Raises:ImportError – If the DeepSpeed package is not installed.

################# Examples

>>> from espnet2.train.deepspeed_trainer import DeepSpeedTrainer
>>> options = DeepSpeedTrainerOptions(
...     resume=False,
...     seed=42,
...     train_dtype='fp16',
...     log_interval=100,
...     output_dir='output',
...     max_epoch=10,
...     deepspeed_config='ds_config.json'
... )
>>> trainer = DeepSpeedTrainer()
>>> trainer.run(model, train_iter_factory, valid_iter_factory, options)

classmethod setup_data_dtype(deepspeed_config: Dict)

Sets up the data type for training based on the DeepSpeed configuration.

This method determines the appropriate data type (dtype) for training based on the provided DeepSpeed configuration. It checks for specific keys in the configuration dictionary to decide between bfloat16, float16, or float32.

cls

The class method reference.

Parameters:
- deepspeed_config (Dict) – A dictionary containing DeepSpeed configuration
- options
- "bf16" (which can include)
- "fp16"
- "amp". (or)
Returns: The data type to be used for training.
Return type: torch.dtype

################# Examples

>>> deepspeed_config = {"bf16": True}
>>> dtype = DeepSpeedTrainer.setup_data_dtype(deepspeed_config)
>>> print(dtype)
torch.bfloat16

>>> deepspeed_config = {"fp16": True}
>>> dtype = DeepSpeedTrainer.setup_data_dtype(deepspeed_config)
>>> print(dtype)
torch.float16

>>> deepspeed_config = {}
>>> dtype = DeepSpeedTrainer.setup_data_dtype(deepspeed_config)
>>> print(dtype)
torch.float

######## NOTE The method checks for the presence of “bf16”, “fp16”, and “amp” keys in the configuration. The choice of dtype may depend on the capabilities of the underlying hardware.

classmethod train_one_epoch(model, iterator: Iterable[Tuple[List[str], Dict[str, Tensor]]], reporter: SubReporter, options: DeepSpeedTrainerOptions) → None

Train the model for one epoch using the provided data iterator.

This method handles the training loop for a single epoch, performing forward and backward passes through the model, logging statistics, and updating model parameters. It utilizes distributed training techniques to ensure synchronization across multiple devices.

model

The model to be trained.

Type: DeepSpeedEngine

iterator

An iterable that provides batches of training data.

Type: Iterable[Tuple[List[str], Dict[str, torch.Tensor]]]

reporter

An object for logging and reporting training metrics.

Type:SubReporter

options

Options that configure the training process.

Type:DeepSpeedTrainerOptions
Parameters:
- cls – The class reference.
- model – The model to train, expected to be a DeepSpeedEngine instance.
- iterator – An iterable that yields tuples containing utterance IDs and batches of data.
- reporter – An instance of SubReporter for logging purposes.
- options – A DeepSpeedTrainerOptions instance containing training configuration options.
Returns: This method does not return any value.
Return type: None
Raises:AssertionError – If the batch is not a dictionary.

################# Examples

>>> trainer.train_one_epoch(model, data_iterator, reporter, options)

######## NOTE This method is designed to work in a distributed training setup where multiple processes may be running concurrently. It ensures that all processes synchronize at certain points to maintain consistency in training.

classmethod valid_one_epoch(model, iterator: Iterable[Tuple[List[str], Dict[str, Tensor]]], reporter: SubReporter, options: DeepSpeedTrainerOptions) → None

Validates the model for one epoch.

This method evaluates the model’s performance on the validation dataset for one epoch. It computes the loss and statistics while ensuring that all distributed ranks are synchronized during the validation process.

Parameters:
- model (Union [AbsESPnetModel , DeepSpeedEngine ]) – The model to be validated.
- iterator (Iterable *[*Tuple *[*List *[*str ] , Dict *[*str , torch.Tensor ] ] ]) – An iterator that provides batches of validation data, where each batch is a tuple containing utterance IDs and a dictionary of tensors.
- reporter (SubReporter) – An object responsible for reporting metrics and statistics during the validation process.
- options (DeepSpeedTrainerOptions) – Options for the DeepSpeed trainer, including data types and configuration settings.
Yields: None
Raises:None –

################# Examples

>>> from my_package import MyModel, MyDataLoader
>>> model = MyModel()
>>> valid_iterator = MyDataLoader()
>>> reporter = SubReporter()
>>> options = DeepSpeedTrainerOptions(...)
>>> DeepSpeedTrainer.valid_one_epoch(model, valid_iterator, reporter, options)

######## NOTE This method is designed to work in a distributed environment where synchronization between ranks is necessary. It will stop processing if any rank has completed its validation.