espnet2.train.distributed_utils.DistributedOption
espnet2.train.distributed_utils.DistributedOption
class espnet2.train.distributed_utils.DistributedOption(distributed: bool = False, dist_backend: str = 'nccl', dist_init_method: str = 'env://', dist_world_size: int | None = None, dist_rank: int | None = None, local_rank: int | None = None, ngpu: int = 0, dist_master_addr: str | None = None, dist_master_port: int | None = None, dist_launcher: str | None = None, multiprocessing_distributed: bool = True)
Bases: object
Dataclass to manage distributed training options in PyTorch.
distributed
Flag to enable distributed training. Default is False.
- Type: bool
dist_backend
The backend to use for distributed training. Options include “nccl”, “mpi”, “gloo”, or “tcp”. Default is “nccl”.
- Type: str
dist_init_method
Method for initializing the process group. If “env://”, it uses environment variables for configuration. Default is “env://”.
- Type: str
dist_world_size
Total number of processes participating in the job. Default is None.
- Type: Optional[int]
dist_rank
Rank of the current process. Default is None.
- Type: Optional[int]
local_rank
Rank of the current process on the node. Default is None.
- Type: Optional[int]
ngpu
Number of GPUs available for training. Default is 0.
- Type: int
dist_master_addr
Address of the master process. Default is None.
- Type: Optional[str]
dist_master_port
Port of the master process. Default is None.
- Type: Optional[int]
dist_launcher
The launcher used to start the distributed job. Default is None.
- Type: Optional[str]
multiprocessing_distributed
Flag indicating if the training is using multiprocessing. Default is True.
- Type: bool
init_options()
Initializes distributed training options based on the specified attributes.
init_torch_distributed()
Initializes the PyTorch distributed process group.
init_deepspeed()
Initializes DeepSpeed distributed training.
- Raises:
- RuntimeError – If required environment variables are not set or if inconsistencies in ranks or world size are detected.
- ValueError – If trying to initialize DeepSpeed without initializing PyTorch distributed first.
########### Examples
>>> options = DistributedOption(distributed=True, ngpu=2)
>>> options.init_options()
>>> options.init_torch_distributed()
>>> options.init_deepspeed()
####### NOTE This class is designed to be used in distributed training scenarios, particularly with PyTorch and DeepSpeed.
dist_backend
dist_init_method
dist_launcher
dist_master_addr
dist_master_port
dist_rank
dist_world_size
distributed
init_deepspeed()
Initialize DeepSpeed for distributed training.
This method sets up DeepSpeed for distributed training by first ensuring that PyTorch’s distributed backend is initialized. It checks that the necessary environment variables are set and raises appropriate errors if they are not. The method also logs a warning if the OMP_NUM_THREADS environment variable is set to 1, suggesting that this may not be sufficient for optimal performance with DeepSpeed.
- Raises:
- ImportError – If the DeepSpeed package cannot be imported.
- ValueError – If PyTorch distributed is not initialized before
- initializing DeepSpeed. –
########### Examples
>>> distributed_options = DistributedOption(distributed=True)
>>> distributed_options.init_options()
>>> distributed_options.init_torch_distributed()
>>> distributed_options.init_deepspeed()
####### NOTE Ensure that the environment variables for distributed training, such as RANK, WORLD_SIZE, and LOCAL_RANK, are properly set before calling this method.
init_options()
Initialize the options for distributed training.
This method configures the distributed training settings based on the specified attributes. It verifies that the necessary environment variables are set and assigns values to the dist_rank, dist_world_size, and local_rank attributes. It also checks for potential issues such as exceeding the number of visible devices.
If the dist_init_method is set to “env://”, the method will attempt to retrieve the master address and port from the environment variables or use the specified values. If both the master address and port are provided, it will set the dist_init_method to a TCP URL format.
- Raises:
- RuntimeError – If required environment variables or attributes are
- not set correctly or if the rank exceeds the world size. –
########### Examples
Example 1: Using default environment variables
options = DistributedOption(distributed=True) options.init_options()
Example 2: Custom master address and port
options = DistributedOption(
distributed=True, dist_master_addr=”192.168.1.1”, dist_master_port=12345
) options.init_options()
init_torch_distributed()
Initializes the PyTorch distributed environment.
This method sets up the distributed training environment using PyTorch’s torch.distributed module. It checks if distributed training is enabled and initializes the process group based on the specified backend, initialization method, world size, and rank.
It also configures the CUDA device if multiple GPUs are being used and the local rank is specified.
####### NOTE This method should be called after the distributed options have been set up correctly, typically after calling init_options.
- Raises:ValueError – If the distributed environment is not properly initialized or if the rank is greater than or equal to the world size.
########### Examples
>>> dist_option = DistributedOption(distributed=True)
>>> dist_option.init_options()
>>> dist_option.init_torch_distributed()
SEE ALSO
PyTorch documentation on distributed training: https://pytorch.org/docs/stable/distributed.html
local_rank
multiprocessing_distributed
ngpu