espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.MultiblankGPURNNT
espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.MultiblankGPURNNT
class espnet2.asr.transducer.rnnt_multi_blank.utils.cuda_utils.gpu_rnnt.MultiblankGPURNNT(sigma: float, num_big_blanks: int, minibatch: int, maxT: int, maxU: int, alphabet_size: int, workspace, big_blank_workspace, blank: int, fastemit_lambda: float, clamp: float, num_threads: int, stream)
Bases: GPURNNT
Helper class to launch the CUDA Kernels to compute Multi-blank Transducer Loss.
This class extends the GPURNNT class to accommodate multi-blank RNNTs. It utilizes CUDA kernels to efficiently compute both the loss and gradients required for training multi-blank transducers as described in the paper (https://arxiv.org/pdf/2211.03541).
sigma
Hyper-parameter related to the logit-normalization method in training multi-blank transducers.
- Type: float
num_big_blanks
Number of big blank symbols the model has, excluding the standard blank symbol.
- Type: int
big_blank_workspace
Allocated memory for multi-blank related computations.
Type: torch.Tensor
Parameters:
- sigma – Hyper-parameter related to the logit-normalization method.
- num_big_blanks – Number of big blank symbols the model has.
- minibatch – Int representing the batch size.
- maxT – The maximum possible acoustic sequence length (T in logprobs).
- maxU – The maximum possible target sequence length (U in logprobs).
- alphabet_size – The vocabulary dimension (V + 1 + num_big_blanks).
- workspace – Memory chunk for working memory.
- big_blank_workspace – Memory chunk specifically for multi-blank computations.
- blank – Index of the RNNT blank token in the vocabulary.
- fastemit_lambda – Float scaling factor for FastEmit regularization.
- clamp – Float value to clamp the gradient to [-clamp, clamp].
- num_threads – Number of OMP threads to launch.
- stream – Numba CUDA Stream.
########### Examples
``
`
python multiblank_rnnt = MultiblankGPURNNT(
sigma=0.5, num_big_blanks=3, minibatch=32, maxT=100, maxU=50, alphabet_size=30, workspace=torch.zeros(1024).cuda(), big_blank_workspace=torch.zeros(512).cuda(), blank=0, fastemit_lambda=0.1, clamp=5.0, num_threads=4, stream=cuda.stream()
)
######## NOTE The compute_cost_and_score method computes both the loss and gradients. Ensure that all input tensors are properly initialized before calling methods.
Helper class to launch the CUDA Kernels to compute Multi-blank
Transducer Loss(https://arxiv.org/pdf/2211.03541).
- param sigma: Hyper-parameter related to the logit-normalization method in training multi-blank transducers.
- param num_big_blanks: Number of big blank symbols the model has. This should not include the standard blank symbol.
- param minibatch: Int representing the batch size.
- param maxT: The maximum possible acoustic sequence length. Represents T in the logprobs tensor.
- param maxU: The maximum possible target sequence length. Represents U in the logprobs tensor.
- param alphabet_size: The vocabulary dimension V + 1 + num-big-blanks
- param workspace: An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory.
- param big_blank_workspace: An allocated chunk of memory that will be sliced off and reshaped into required blocks used as working memory specifically for the multi-blank related computations.
- param blank: Index of the RNNT blank token in the vocabulary. Generally the first or last token in the vocab.
- param fastemit_lambda: Float scaling factor for FastEmit regularization. Refer to FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization.
- param clamp: Float value. When set to value >= 0.0, will clamp the gradient to [-clamp, clamp].
- param num_threads: Number of OMP threads to launch.
- param stream: Numba Cuda Stream.
compute_cost_and_score(acts: Tensor, grads: Tensor | None, costs: Tensor, labels: Tensor, label_lengths: Tensor, input_lengths: Tensor) → RNNTStatus
Compute both the loss and the gradients.
This method calculates the negative log likelihood loss and, if gradients are required, computes the gradients of the activation matrix with respect to the costs vector. It utilizes CUDA kernels for efficient computation.
- Parameters:
- acts – A flattened tensor of shape [B, T, U, V+1] representing the activation matrix.
- grads – A flattened zero tensor of the same shape as acts, which will be updated with the computed gradients.
- costs – A zero vector of length B that will be updated in-place with the log probability costs.
- labels – A flattened matrix of labels of shape [B, U].
- label_lengths – A vector of length B that contains the original lengths of the target sequence.
- input_lengths – A vector of length B that contains the original lengths of the acoustic sequence.
Updates: : This method will launch CUDA kernels that update the following variables in-place:
- grads: Gradients of the activation matrix with respect to the <br/>
costs vector.
- costs: Negative log likelihood of the forward variable.
- Returns: An enum representing the status of the RNNT operation, which can indicate success or failure.
########### Examples
Example usage:
acts = torch.rand(B, T, U, V + 1) # Random activation matrix grads = torch.zeros_like(acts) # Initialize gradients costs = torch.zeros(B) # Initialize costs labels = torch.randint(0, V, (B, U)) # Random labels label_lengths = torch.randint(1, U, (B,)) # Random label lengths input_lengths = torch.randint(1, T, (B,)) # Random input lengths
status = compute_cost_and_score(acts, grads, costs, labels, : label_lengths, input_lengths)
######## NOTE Ensure that the input tensors are properly flattened and have the correct shapes as expected by the function.
cost_and_grad(acts: Tensor, grads: Tensor, costs: Tensor, pad_labels: Tensor, label_lengths: Tensor, input_lengths: Tensor)
Computes the cost and gradients of the activation tensor.
This function checks for the validity of the input tensors and then computes the cost and gradients by calling the internal method compute_cost_and_score. The inputs include the activation tensor, gradients tensor, costs tensor, padded labels, label lengths, and input lengths.
- Parameters:
- acts (torch.Tensor) – A flattened tensor of shape [B, T, U, V+1] representing the activation matrix.
- grads (torch.Tensor) – A flattened tensor of the same shape as acts, which will be updated in place with gradients.
- costs (torch.Tensor) – A zero vector of length B that will be updated in place with the log probability costs.
- pad_labels (torch.Tensor) – A flattened matrix of labels of shape [B, U].
- label_lengths (torch.Tensor) – A vector of length B that contains the original lengths of the acoustic sequences.
- input_lengths (torch.Tensor) – A vector of length B that contains the original lengths of the target sequences.
- Returns: An enum indicating either a successful RNNT operation or failure due to invalid input.
- Return type:global_constants.RNNTStatus
- Raises:
- global_constants.RNNTStatus.RNNT_STATUS_INVALID_VALUE – If any
- of the input tensors are None. –
########### Examples
>>> acts = torch.randn(4, 10, 5, 12) # Example activation tensor
>>> grads = torch.zeros_like(acts) # Initialize gradients
>>> costs = torch.zeros(4) # Initialize costs
>>> pad_labels = torch.randint(0, 10, (4, 5)) # Padded labels
>>> label_lengths = torch.tensor([5, 4, 5, 3]) # Lengths of labels
>>> input_lengths = torch.tensor([10, 10, 10, 10]) # Input lengths
>>> result = model.cost_and_grad(acts, grads, costs, pad_labels,
label_lengths, input_lengths)
######## NOTE Ensure that all tensors are on the same device (CPU or GPU) to avoid runtime errors.
score_forward(acts: Tensor, costs: Tensor, pad_labels: Tensor, label_lengths: Tensor, input_lengths: Tensor)
Compute the forward score for the RNNT model.
This function computes the negative log likelihood costs for the given activations without calculating gradients. It is useful during inference or evaluation where only the score is required.
- Parameters:
- acts – A tensor of shape [B, T, U, V+1] representing the activation matrix from the model, where B is the batch size, T is the maximum acoustic sequence length, U is the maximum target sequence length, and V is the vocabulary size.
- costs – A tensor of shape [B] that will be updated in-place with the log probability costs for each element in the batch.
- pad_labels – A tensor of shape [B, U] containing the padded target labels for each element in the batch.
- label_lengths – A tensor of shape [B] containing the actual lengths of the target labels for each element in the batch.
- input_lengths – A tensor of shape [B] containing the actual lengths of the input sequences for each element in the batch.
- Returns: An enum value representing the status of the RNNT operation, which can indicate success or failure.
- Raises:
- global_constants.RNNTStatus.RNNT_STATUS_INVALID_VALUE – If any of
- the input tensors are None. –
########### Examples
>>> acts = torch.rand(2, 10, 5, 20) # Example activation tensor
>>> costs = torch.zeros(2) # Initialize costs tensor
>>> pad_labels = torch.tensor([[1, 2, 3], [1, 0, 0]])
>>> label_lengths = torch.tensor([3, 1])
>>> input_lengths = torch.tensor([10, 10])
>>> status = model.score_forward(acts, costs, pad_labels,
label_lengths, input_lengths)
>>> print(costs) # Updated costs tensor after the call
######## NOTE Ensure that all input tensors are correctly shaped and contain valid data before calling this function to avoid errors.