espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.CTAReduce
Less than 1 minute
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.CTAReduce
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.reduce.CTAReduce(tid: int, x, storage, count: int, R_opid: int)
CTAReduce performs a CUDA Warp reduction on a given input tensor.
This function implements a device kernel for reducing input values using a specified reduction operation. The data is recursively read from the right segment and reduced onto the left half, continuing until the warp size is larger than a given offset. Beyond this offset, warp reduction is performed using shfl_down_sync, effectively halving the reduction space and combining the results in each iteration.
NOTE
Efficient warp occurs at input shapes of 2 ^ K.
References
- Warp Primitives
[https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/]
- Parameters:
- tid – int CUDA thread index.
- x – float Activation value to be reduced.
- storage – array Shared memory of size CTA_REDUCE_SIZE used for parallel reduction.
- count – int Equivalent to num_rows, which corresponds to alphabet_size (V + 1).
- R_opid – int Operator ID for reduction. See R_Op for more information.
- Returns: float : The reduced value after applying the specified reduction operation.
Examples
>>> storage = cuda.shared.array(shape=(CTA_REDUCE_SIZE,), dtype=float)
>>> reduced_value = CTAReduce(tid=0, x=5.0, storage=storage, count=16, R_opid=0)
- Raises:None –