espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.reduce_max

About 1 min

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.reduce_max

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.reduce_max(acts: Tensor, denom, rows: int, cols: int, minus: bool, stream)

Helper method to call the Warp Reduction Kernel to perform max reduction.

This function facilitates the execution of a warp reduction operation that computes the maximum value across the specified dimensions of the flattened activation matrix acts. The result is stored in the denom output tensor.

Efficient warp reduction is particularly effective when the input shapes are powers of two (2^K).

References

Warp Primitives [https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/]

Parameters:
- acts – A flattened activation matrix of shape [B * T * U * (V+1)]. This tensor contains the values to be reduced.
- denom – A flattened output matrix of shape [B * T * U * (V+1)]. This tensor will be overwritten with the results of the reduction operation.
- rows – An integer representing the vocabulary size (including blank token) - V+1. This value indicates the number of threads per block.
- cols – An integer representing the flattened shape of the activation matrix, excluding the vocabulary dimension (B * T * U). This value indicates the number of blocks per grid.
- minus – A boolean flag indicating whether to add or subtract during the reduction. If minus is set to True, it calls the _reduce_minus kernel; otherwise, it calls the _reduce_rows kernel.
- stream – The CUDA Stream used for executing the kernel.
Returns: A boolean value indicating the success of the reduction operation.

Examples

>>> acts = torch.rand(256, 10, 20, 30)  # Example activation tensor
>>> denom = torch.zeros(256, 10, 20, 30)  # Output tensor
>>> rows = 31  # Example vocabulary size
>>> cols = 2560  # Example number of blocks
>>> stream = cuda.stream()  # Example CUDA stream
>>> reduce_max(acts, denom, rows, cols, False, stream)