espnet2.train.distributed_utils.get_num_nodes

Less than 1 minute

espnet2.train.distributed_utils.get_num_nodes

espnet2.train.distributed_utils.get_num_nodes(prior=None, launcher: str | None = None) → int | None

Get the number of nodes.

Use for “multiprocessing distributed” mode. The RANK equals to the Node ID in this case and the real Rank is set as (nGPU * NodeID) + LOCAL_RANK in torch.distributed.

This function determines the number of nodes participating in the distributed training setup. It checks the launcher type (e.g., slurm, mpi) to retrieve the appropriate environment variables or uses the provided parameter if available.

Parameters:
- prior (Optional *[*int ]) – An optional prior value for the number of nodes. If provided, this value will be returned directly without checking the environment.
- launcher (Optional *[*str ]) – The launcher type used to start the process. This can be “slurm”, “mpi”, or None. If None, it defaults to checking the WORLD_SIZE environment variable.
Returns: The number of nodes participating in the distributed training. Returns 1 if no nodes are found and no prior is provided.
Return type: Optional[int]
Raises:
- RuntimeError – If the launcher is “slurm” and the environment is not
- set up correctly**, or** if the launcher is not supported. –

Examples

>>> get_num_nodes()
1

>>> get_num_nodes(prior=3)
3

>>> get_num_nodes(launcher="slurm")
5  # Assuming SLURM_STEP_NUM_NODES is set to 5 in the environment.