espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.CTAReduce

Less than 1 minute

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.CTAReduce

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.CTAReduce(tid: int, x, storage, count: int, R_opid: int)

CTAReduce performs a CUDA Warp reduction on a given input tensor.

This function implements a device kernel for reducing input values using a specified reduction operation. The data is recursively read from the right segment and reduced onto the left half, continuing until the warp size is larger than a given offset. Beyond this offset, warp reduction is performed using shfl_down_sync, effectively halving the reduction space and combining the results in each iteration.

NOTE

Efficient warp occurs at input shapes of 2 ^ K.

References

Warp Primitives

[https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/]

Parameters:
- tid – int CUDA thread index.
- x – float Activation value to be reduced.
- storage – array Shared memory of size CTA_REDUCE_SIZE used for parallel reduction.
- count – int Equivalent to num_rows, which corresponds to alphabet_size (V + 1).
- R_opid – int Operator ID for reduction. See R_Op for more information.
Returns: float : The reduced value after applying the specified reduction operation.

Examples

>>> storage = cuda.shared.array(shape=(CTA_REDUCE_SIZE,), dtype=float)
>>> reduced_value = CTAReduce(tid=0, x=5.0, storage=storage, count=16, R_opid=0)

Raises:None –