espnet2.train.distributed_utils.DistributedOption

About 3 min

espnet2.train.distributed_utils.DistributedOption

class espnet2.train.distributed_utils.DistributedOption(distributed: bool = False, dist_backend: str = 'nccl', dist_init_method: str = 'env://', dist_world_size: int | None = None, dist_rank: int | None = None, local_rank: int | None = None, ngpu: int = 0, dist_master_addr: str | None = None, dist_master_port: int | None = None, dist_launcher: str | None = None, multiprocessing_distributed: bool = True)

Bases: object

Dataclass to manage distributed training options in PyTorch.

distributed

Flag to enable distributed training. Default is False.

Type: bool

dist_backend

The backend to use for distributed training. Options include “nccl”, “mpi”, “gloo”, or “tcp”. Default is “nccl”.

Type: str

dist_init_method

Method for initializing the process group. If “env://”, it uses environment variables for configuration. Default is “env://”.

Type: str

dist_world_size

Total number of processes participating in the job. Default is None.

Type: Optional[int]

dist_rank

Rank of the current process. Default is None.

Type: Optional[int]

local_rank

Rank of the current process on the node. Default is None.

Type: Optional[int]

ngpu

Number of GPUs available for training. Default is 0.

Type: int

dist_master_addr

Address of the master process. Default is None.

Type: Optional[str]

dist_master_port

Port of the master process. Default is None.

Type: Optional[int]

dist_launcher

The launcher used to start the distributed job. Default is None.

Type: Optional[str]

multiprocessing_distributed

Flag indicating if the training is using multiprocessing. Default is True.

Type: bool

init_options()

Initializes distributed training options based on the specified attributes.

init_torch_distributed()

Initializes the PyTorch distributed process group.

init_deepspeed()

Initializes DeepSpeed distributed training.

Raises:
- RuntimeError – If required environment variables are not set or if inconsistencies in ranks or world size are detected.
- ValueError – If trying to initialize DeepSpeed without initializing PyTorch distributed first.

########### Examples

>>> options = DistributedOption(distributed=True, ngpu=2)
>>> options.init_options()
>>> options.init_torch_distributed()
>>> options.init_deepspeed()

####### NOTE This class is designed to be used in distributed training scenarios, particularly with PyTorch and DeepSpeed.

dist_backend

*: str* *= 'nccl'*

dist_init_method

*: str* *= 'env://'*

dist_launcher

*: str | [None](../asr/AbsDecoder.md#espnet2.asr.decoder.abs_decoder.AbsDecoder.None)* *= None*

dist_master_addr

*: str | [None](../asr/AbsDecoder.md#espnet2.asr.decoder.abs_decoder.AbsDecoder.None)* *= None*

dist_master_port

*: int | [None](../asr/AbsDecoder.md#espnet2.asr.decoder.abs_decoder.AbsDecoder.None)* *= None*

dist_rank

*: int | [None](../asr/AbsDecoder.md#espnet2.asr.decoder.abs_decoder.AbsDecoder.None)* *= None*

dist_world_size

*: int | [None](../asr/AbsDecoder.md#espnet2.asr.decoder.abs_decoder.AbsDecoder.None)* *= None*

distributed

*: bool* *= False*

init_deepspeed()

Initialize DeepSpeed for distributed training.

This method sets up DeepSpeed for distributed training by first ensuring that PyTorch’s distributed backend is initialized. It checks that the necessary environment variables are set and raises appropriate errors if they are not. The method also logs a warning if the OMP_NUM_THREADS environment variable is set to 1, suggesting that this may not be sufficient for optimal performance with DeepSpeed.

Raises:
- ImportError – If the DeepSpeed package cannot be imported.
- ValueError – If PyTorch distributed is not initialized before
- initializing DeepSpeed. –

########### Examples

>>> distributed_options = DistributedOption(distributed=True)
>>> distributed_options.init_options()
>>> distributed_options.init_torch_distributed()
>>> distributed_options.init_deepspeed()

####### NOTE Ensure that the environment variables for distributed training, such as RANK, WORLD_SIZE, and LOCAL_RANK, are properly set before calling this method.

init_options()

Initialize the options for distributed training.

This method configures the distributed training settings based on the specified attributes. It verifies that the necessary environment variables are set and assigns values to the dist_rank, dist_world_size, and local_rank attributes. It also checks for potential issues such as exceeding the number of visible devices.

If the dist_init_method is set to “env://”, the method will attempt to retrieve the master address and port from the environment variables or use the specified values. If both the master address and port are provided, it will set the dist_init_method to a TCP URL format.

Raises:
- RuntimeError – If required environment variables or attributes are
- not set correctly or if the rank exceeds the world size. –

########### Examples

Example 1: Using default environment variables

options = DistributedOption(distributed=True) options.init_options()

Example 2: Custom master address and port

options = DistributedOption(

distributed=True, dist_master_addr=”192.168.1.1”, dist_master_port=12345

) options.init_options()

init_torch_distributed()

Initializes the PyTorch distributed environment.

This method sets up the distributed training environment using PyTorch’s torch.distributed module. It checks if distributed training is enabled and initializes the process group based on the specified backend, initialization method, world size, and rank.

It also configures the CUDA device if multiple GPUs are being used and the local rank is specified.

####### NOTE This method should be called after the distributed options have been set up correctly, typically after calling init_options.

Raises:ValueError – If the distributed environment is not properly initialized or if the rank is greater than or equal to the world size.

########### Examples

>>> dist_option = DistributedOption(distributed=True)
>>> dist_option.init_options()
>>> dist_option.init_torch_distributed()