espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.GPURNNT

About 5 min

espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.GPURNNT

class espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.GPURNNT(minibatch: int, maxT: int, maxU: int, alphabet_size: int, workspace, blank: int, fastemit_lambda: float, clamp: float, num_threads: int, stream)

Bases: object

Helper class to launch the CUDA Kernels to compute the Transducer Loss.

This class is responsible for computing the RNNT (Recurrent Neural Network Transducer) loss and its gradients using CUDA kernels. It manages workspace memory and handles input activations to efficiently compute both the loss and the gradients.

Args: : minibatch: Int representing the batch size. maxT: The maximum possible acoustic sequence length. Represents T 
in the logprobs tensor. maxU: The maximum possible target sequence length. Represents U : in the logprobs tensor. alphabet_size: The vocabulary dimension V+1 (inclusive of RNNT : blank). workspace: An allocated chunk of memory that will be sliced off : and reshaped into required blocks used as working memory. blank: Index of the RNNT blank token in the vocabulary. Generally : the first or last token in the vocab. fastemit_lambda: Float scaling factor for FastEmit regularization. : Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization. clamp: Float value. When set to value >= 0.0, will clamp the : gradient to [-clamp, clamp]. num_threads: Number of OMP threads to launch. stream: Numba Cuda Stream.

Examples: : ```python
workspace = torch.zeros((minibatch, maxT * maxU * (alphabet_size + 1))) gpu_rnnt = GPURNNT(minibatch=32, maxT=100, maxU=50, ... alphabet_size=29, workspace=workspace, ... blank=0, fastemit_lambda=0.5, clamp=0.1, ... num_threads=4, stream=cuda.stream())

Helper class to launch the CUDA Kernels to compute the Transducer Loss.

Parameters:
- minibatch – Int representing the batch size.
- maxT – The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
- maxU – The maximum possible target sequence length. Represents U in the logprobs tensor.
- alphabet_size – The vocabulary dimension V+1 (inclusive of RNNT blank).
- workspace – An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory.
- blank – Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
- fastemit_lambda – Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
- clamp – Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
- num_threads – Number of OMP threads to launch.
- stream – Numba Cuda Stream.

compute_cost_and_score(acts: Tensor, grads: Tensor | None, costs: Tensor, labels: Tensor, label_lengths: Tensor, input_lengths: Tensor) → RNNTStatus

Compute both the loss and the gradients.

This method computes the negative log likelihood costs and the gradients for the RNNT model during training or evaluation. It performs the forward pass through the RNNT computation graph, calculating both alphas and betas, which are used to derive the gradients and costs.

Parameters:
- acts – A flattened tensor of shape [B, T, U, V+1] representing the activation matrix, where B is the batch size, T is the maximum acoustic sequence length, U is the maximum target sequence length, and V is the vocabulary size.
- grads – An optional flattened tensor of the same shape as acts, initialized to zero, to store gradients. If not provided, gradients will not be computed.
- costs – A zero vector of length B that will be updated in place with the log probability costs.
- labels – A flattened matrix of labels of shape [B, U], representing the target sequences for each batch.
- label_lengths – A vector of length B that contains the original lengths of the target sequences.
- input_lengths – A vector of length B that contains the original lengths of the acoustic sequences.

Updates: : This method will launch kernels that will update the following variables in place:

grads: Gradients of the activation matrix with respect to the costs vector.
costs: Negative log likelihood of the forward variable.

Returns: An enum that represents either a successful RNNT operation or failure.

########### Examples

>>> acts = torch.rand((2, 10, 5, 20))  # Random activation matrix
>>> grads = torch.zeros((2, 10, 5, 20))  # Initialize gradients
>>> costs = torch.zeros(2)  # Initialize costs
>>> labels = torch.tensor([[1, 2, 3, 0, 0], [2, 3, 0, 0, 0]])  # Example labels
>>> label_lengths = torch.tensor([3, 2])  # Lengths of each label
>>> input_lengths = torch.tensor([10, 10])  # Lengths of each input
>>> status = compute_cost_and_score(acts, grads, costs, labels,
                                    label_lengths, input_lengths)
>>> print(status)  # Check the status of the operation

####### NOTE Ensure that the input tensors are properly shaped and initialized before calling this method.

Raises:
- ValueError – If the shapes of the input tensors do not match the
- expected dimensions. –

cost_and_grad(acts: Tensor, grads: Tensor, costs: Tensor, pad_labels: Tensor, label_lengths: Tensor, input_lengths: Tensor)

Computes the cost and gradients for the given activation tensor.

This function evaluates the negative log likelihood and computes the gradients of the RNNT model with respect to the input activations. It is designed to handle cases where the gradients need to be computed during training, as well as to compute the forward score when gradients are not required.

Parameters:
- acts (torch.Tensor) – A flattened tensor of shape [B, T, U, V+1] representing the activation matrix.
- grads (torch.Tensor) – A flattened zero tensor of the same shape as acts, which will be updated in place to hold the gradients.
- costs (torch.Tensor) – A zero vector of length B that will be updated in place with the log probability costs.
- pad_labels (torch.Tensor) – A flattened matrix of labels of shape [B, U].
- label_lengths (torch.Tensor) – A vector of length B containing the original lengths of the target sequences.
- input_lengths (torch.Tensor) – A vector of length B containing the original lengths of the acoustic sequences.
Returns: An enum that indicates the status of the RNNT operation, which can either represent success or failure.
Return type:global_constants.RNNTStatus
Raises:
- global_constants.RNNTStatus.RNNT_STATUS_INVALID_VALUE – If any of the
- input tensors are None. –

########### Examples

>>> acts = torch.randn(2, 10, 5, 20)  # Random activation tensor
>>> grads = torch.zeros_like(acts)  # Zero gradients tensor
>>> costs = torch.zeros(2)  # Zero costs vector
>>> pad_labels = torch.randint(0, 20, (2, 5))  # Random labels
>>> label_lengths = torch.tensor([5, 4])  # Example lengths
>>> input_lengths = torch.tensor([10, 10])  # Example lengths
>>> status = gpurnnt.cost_and_grad(acts, grads, costs, pad_labels,
...                                 label_lengths, input_lengths)
>>> print(status)  # Should print the status of the operation

####### NOTE This method is part of the GPURNNT class, which provides various functionalities for working with RNNT models.

log_softmax(acts: Tensor, denom: Tensor)

Computes the log softmax denominator of the input activation tensor and stores the result in the provided denom tensor.

This method calculates the log softmax of the input activation tensor acts, which is expected to be a tensor of shape [B, T, U, V+1]. The results are stored in the denom tensor, which should be initialized as a zero tensor of the same shape as acts.

Parameters:
- acts – Activation tensor of shape [B, T, U, V+1]. The input must be represented as a flat tensor of shape [B * T * U * (V+1)] to allow pointer indexing.
- denom – A zero tensor of the same shape as acts that will be updated in place with the computed log softmax values.

Updates: : This method performs in-place updates to the denom tensor.

########### Examples

>>> acts = torch.randn(2, 3, 4, 5)  # Random activation tensor
>>> denom = torch.zeros_like(acts)  # Initialize denom tensor
>>> log_softmax(acts, denom)  # Compute log softmax

####### NOTE This method uses CUDA kernels to perform the computations efficiently on the GPU.

Raises:ValueError – If acts or denom are not the correct shapes or types.

score_forward(acts: Tensor, costs: Tensor, pad_labels: Tensor, label_lengths: Tensor, input_lengths: Tensor)

Computes the forward score and updates the costs tensor.

This method calculates the loss based on the given activation tensor and updates the costs tensor with the negative log likelihood. It does not compute gradients since the grads parameter is set to None.

Parameters:
- acts – A flattened tensor of shape [B, T, U, V+1] representing the activation matrix.
- costs – A zero vector of length B which will be updated in-place with the log probability costs.
- pad_labels – A flattened matrix of labels of shape [B, U].
- label_lengths – A vector of length B that contains the original lengths of the target sequences.
- input_lengths – A vector of length B that contains the original lengths of the acoustic sequences.
Returns: An enum that either represents a successful RNNT operation or failure.
Raises:ValueError – If any of the input tensors are None.

########### Examples

>>> acts = torch.rand(2, 5, 6, 10)  # Example activation tensor
>>> costs = torch.zeros(2)  # Initialize costs tensor
>>> pad_labels = torch.tensor([[1, 2, 3], [1, 2, 3]])
>>> label_lengths = torch.tensor([3, 3])
>>> input_lengths = torch.tensor([5, 5])
>>> result = model.score_forward(acts, costs, pad_labels,
...                               label_lengths, input_lengths)
>>> print(costs)  # Updated costs after calling score_forward