espnet2.asr_transducer.decoder.rnn_decoder.RNNDecoder
espnet2.asr_transducer.decoder.rnn_decoder.RNNDecoder
class espnet2.asr_transducer.decoder.rnn_decoder.RNNDecoder(vocab_size: int, embed_size: int = 256, hidden_size: int = 256, rnn_type: str = 'lstm', num_layers: int = 1, dropout_rate: float = 0.0, embed_dropout_rate: float = 0.0, embed_pad: int = 0)
Bases: AbsDecoder
RNN decoder definition for Transducer models.
This class implements an RNN decoder module used in Transducer models. It supports both LSTM and GRU architectures and allows for customization of various parameters such as embedding size, hidden size, and dropout rates.
embed
Embedding layer for the input labels.
- Type: torch.nn.Embedding
dropout_embed
Dropout layer for the embedding.
- Type: torch.nn.Dropout
rnn
List of RNN layers (LSTM/GRU).
- Type: torch.nn.ModuleList
dropout_rnn
List of dropout layers for RNN outputs.
- Type: torch.nn.ModuleList
dlayers
Number of decoder layers.
- Type: int
dtype
Type of RNN used (‘lstm’ or ‘gru’).
- Type: str
output_size
Size of the output from the decoder.
- Type: int
vocab_size
Size of the vocabulary.
- Type: int
device
Device to run the model on (CPU/GPU).
- Type: torch.device
score_cache
Cache for storing scores of previous hypotheses.
Type: dict
Parameters:
- vocab_size (int) – Vocabulary size.
- embed_size (int , optional) – Embedding size. Default is 256.
- hidden_size (int , optional) – Hidden size. Default is 256.
- rnn_type (str , optional) – Decoder layers type (‘lstm’ or ‘gru’). Default is ‘lstm’.
- num_layers (int , optional) – Number of decoder layers. Default is 1.
- dropout_rate (float , optional) – Dropout rate for decoder layers. Default is 0.0.
- embed_dropout_rate (float , optional) – Dropout rate for embedding layer. Default is 0.0.
- embed_pad (int , optional) – Embedding padding symbol ID. Default is 0.
##################### Examples
Create an RNNDecoder instance
decoder = RNNDecoder(vocab_size=1000, embed_size=256, hidden_size=512)
Forward pass with a batch of label sequences
labels = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.long) output = decoder(labels)
Initialize decoder states
states = decoder.init_state(batch_size=2)
One-step forward hypothesis scoring
out, new_states = decoder.score(label_sequence=[1, 2], states=states)
######## NOTE The decoder supports only ‘lstm’ and ‘gru’ as valid RNN types. Attempting to use any other type will raise a ValueError.
Construct a RNNDecoder object.
batch_score(hyps: List[Hypothesis]) → Tuple[Tensor, Tuple[Tensor, Tensor | None]]
One-step forward hypotheses.
This method takes a list of hypotheses and computes the decoder output sequences for each hypothesis. It utilizes the last label from each hypothesis to generate the embeddings and feed them into the RNN.
Parameters:hyps – A list of Hypothesis objects, each containing a sequence of label IDs and the corresponding decoder states.
Returns: Decoder output sequences of shape (B, D_dec), where B is the : batch size and D_dec is the decoder output dimension.
states: Decoder hidden states in the form of a tuple containing : two elements: : - The hidden states of shape ((N, B, D_dec), …) - The cell states (only present if using LSTM), also of shape ((N, B, D_dec), …).
Return type: out
##################### Examples
>>> from espnet2.asr_transducer.decoder.rnn_decoder import RNNDecoder
>>> from espnet2.asr_transducer.beam_search_transducer import Hypothesis
>>> decoder = RNNDecoder(vocab_size=100, embed_size=64, hidden_size=128)
>>> hyps = [Hypothesis(yseq=[1, 2, 3], dec_state=None),
... Hypothesis(yseq=[4, 5, 6], dec_state=None)]
>>> out, states = decoder.batch_score(hyps)
>>> print(out.shape) # Output shape should be (2, 128)
Create decoder hidden states.
- Parameters:new_states – List of decoder hidden states, where each state is a tuple of tensors. Each tensor corresponds to a specific hypothesis, and the format is as follows:
- For LSTM: (N, 1, D_dec)
- For GRU: (N, 1, D_dec) or None
- Returns: Combined decoder hidden states. The output is a tuple of tensors : structured as:
- For LSTM: ((N, B, D_dec), (N, B, D_dec))
- For GRU: ((N, B, D_dec), None)
- Return type: states
##################### Examples
>>> new_states = [(torch.zeros(2, 1, 256), torch.zeros(2, 1, 256)),
... (torch.zeros(2, 1, 256), torch.zeros(2, 1, 256))]
>>> batch_states = create_batch_states(new_states)
>>> print(batch_states[0].shape) # Output: (2, 2, 256)
>>> print(batch_states[1].shape) # Output: (2, 2, 256)
forward(labels: Tensor) → Tensor
RNN decoder definition for Transducer models.
This module implements an RNN-based decoder for use in Transducer models, utilizing either LSTM or GRU architectures. It is designed to process label sequences and output decoder states for further processing in sequence modeling tasks.
embed
Embedding layer for the input label sequences.
dropout_embed
Dropout layer for the embedding output.
rnn
List of RNN layers (LSTM or GRU).
dropout_rnn
List of dropout layers for the RNN outputs.
dlayers
Number of decoder layers.
dtype
Type of RNN used (‘lstm’ or ‘gru’).
output_size
Size of the output from the decoder.
vocab_size
Size of the vocabulary.
device
Device on which the model is stored (CPU or GPU).
score_cache
Cache for storing previously computed scores.
- Parameters:
- vocab_size (int) – Vocabulary size.
- embed_size (int , optional) – Size of the embedding layer. Default is 256.
- hidden_size (int , optional) – Size of the hidden layers. Default is 256.
- rnn_type (str , optional) – Type of RNN layers (‘lstm’ or ‘gru’). Default is ‘lstm’.
- num_layers (int , optional) – Number of decoder layers. Default is 1.
- dropout_rate (float , optional) – Dropout rate for decoder layers. Default is 0.0.
- embed_dropout_rate (float , optional) – Dropout rate for embedding layer. Default is 0.0.
- embed_pad (int , optional) – Padding symbol ID for the embedding layer. Default is 0.
##################### Examples
Initialize the RNNDecoder
decoder = RNNDecoder(vocab_size=1000, embed_size=256, hidden_size=256)
Forward pass with labels
labels = torch.randint(0, 1000, (32, 10)) # (B, L) output = decoder.forward(labels) # (B, U, D_dec)
- Returns: Decoder output sequences of shape (B, U, D_dec).
- Return type: out (torch.Tensor)
- Raises:ValueError – If the specified rnn_type is not supported (not ‘lstm’ or ‘gru’).
init_state(batch_size: int) → Tuple[Tensor, tensor | None]
RNN decoder definition for Transducer models.
This module implements an RNN decoder for use in transducer models, providing a mechanism to process sequences of data through recurrent neural networks. The decoder can be configured with various parameters to adjust its architecture and behavior.
vocab_size
Size of the vocabulary.
- Type: int
embed
Size of the embeddings.
- Type: int
hidden_size
Size of the hidden states.
- Type: int
rnn
Type of RNN used (‘lstm’ or ‘gru’).
- Type: str
num_layers
Number of layers in the decoder.
- Type: int
dropout_rate
Dropout rate applied to decoder layers.
- Type: float
embed
Dropout rate applied to embedding layer.
- Type: float
embed
Padding symbol ID for embeddings.
Type: int
Parameters:
- vocab_size (int) – Vocabulary size.
- embed_size (int , optional) – Embedding size (default is 256).
- hidden_size (int , optional) – Hidden size (default is 256).
- rnn_type (str , optional) – Decoder layers type (‘lstm’ or ‘gru’, default is ‘lstm’).
- num_layers (int , optional) – Number of decoder layers (default is 1).
- dropout_rate (float , optional) – Dropout rate for decoder layers (default is 0.0).
- embed_dropout_rate (float , optional) – Dropout rate for embedding layer (default is 0.0).
- embed_pad (int , optional) – Embedding padding symbol ID (default is 0).
##################### Examples
>>> decoder = RNNDecoder(vocab_size=5000, embed_size=256, hidden_size=256)
>>> input_tensor = torch.randint(0, 5000, (32, 10)) # Batch of 32, seq len 10
>>> output = decoder(input_tensor)
>>> output.shape
torch.Size([32, 10, 256]) # (Batch, Sequence Length, Hidden Size)
- Raises:ValueError – If rnn_type is not ‘lstm’ or ‘gru’.
rnn
RNN decoder definition for Transducer models.
This module implements an RNN decoder for Transducer models, utilizing either LSTM or GRU architectures. The decoder processes input sequences and produces output sequences, making it suitable for applications in automatic speech recognition (ASR).
vocab_size
Size of the vocabulary.
embed
Size of the embedding layer.
hidden_size
Size of the hidden layers.
dtype
Type of RNN used (‘lstm’ or ‘gru’).
dlayers
Number of decoder layers.
score_cache
Cache for storing computed scores for efficiency.
- Parameters:
- vocab_size (int) – Vocabulary size.
- embed_size (int , optional) – Embedding size. Default is 256.
- hidden_size (int , optional) – Hidden size. Default is 256.
- rnn_type (str , optional) – Type of RNN layers (‘lstm’ or ‘gru’). Default is ‘lstm’.
- num_layers (int , optional) – Number of decoder layers. Default is 1.
- dropout_rate (float , optional) – Dropout rate for decoder layers. Default is 0.0.
- embed_dropout_rate (float , optional) – Dropout rate for embedding layer. Default is 0.0.
- embed_pad (int , optional) – Embedding padding symbol ID. Default is 0.
##################### Examples
Initialize RNNDecoder
decoder = RNNDecoder(vocab_size=1000, embed_size=256, hidden_size=256)
Forward pass with label sequences
labels = torch.randint(0, 1000, (32, 10)) # Batch of 32 sequences of length 10 output = decoder(labels)
One-step scoring
label_sequence = [1, 2, 3] states = decoder.init_state(batch_size=1) out, states = decoder.score(label_sequence, states)
######## NOTE The decoder supports both LSTM and GRU architectures, but the choice should be made based on the specific requirements of the task.
score(label_sequence: List[int], states: Tuple[Tensor, Tensor | None]) → Tuple[Tensor, Tuple[Tensor, Tensor | None]]
RNN decoder definition for Transducer models.
This module implements an RNN-based decoder for Transducer models, allowing for sequence-to-sequence tasks. The decoder can utilize either LSTM or GRU cells and supports multiple layers and dropout for regularization.
embed
Embedding layer for input sequences.
dropout_embed
Dropout layer for embeddings.
rnn
List of RNN layers (LSTM or GRU).
dropout_rnn
List of dropout layers for RNN outputs.
dlayers
Number of decoder layers.
dtype
Type of RNN used (‘lstm’ or ‘gru’).
output_size
Size of the decoder output.
vocab_size
Size of the vocabulary.
device
Device (CPU or GPU) on which the model resides.
score_cache
Cache for previously computed scores to avoid redundant calculations.
- Parameters:
- vocab_size (int) – Vocabulary size.
- embed_size (int) – Embedding size (default: 256).
- hidden_size (int) – Hidden size (default: 256).
- rnn_type (str) – Type of RNN layers (‘lstm’ or ‘gru’, default: ‘lstm’).
- num_layers (int) – Number of decoder layers (default: 1).
- dropout_rate (float) – Dropout rate for decoder layers (default: 0.0).
- embed_dropout_rate (float) – Dropout rate for embedding layer (default: 0.0).
- embed_pad (int) – Embedding padding symbol ID (default: 0).
##################### Examples
decoder = RNNDecoder(vocab_size=1000, embed_size=256, hidden_size=256) labels = torch.tensor([[1, 2, 3], [4, 5, 6]]) # Example label sequences output = decoder.forward(labels)
- Raises:ValueError – If an unsupported rnn_type is provided during initialization.
Get specified ID state from decoder hidden states.
- Parameters:
- states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec) or None)
- idx – State ID to extract.
- Returns: Decoder hidden state for given ID. ((N, 1, D_dec), (N, 1, D_dec) or None)
##################### Examples
>>> decoder = RNNDecoder(vocab_size=10)
>>> states = decoder.init_state(batch_size=2)
>>> selected_state = decoder.select_state(states, idx=0)
>>> print(selected_state)
(tensor(...), tensor(...) or None)
######## NOTE The function assumes that the states are in the expected format.
set_device(device: device) → None
Set the GPU device to use for the RNN decoder.
This method updates the device attribute of the RNNDecoder class, allowing the model to run on the specified device (CPU or GPU).
- Parameters:device – The device ID (torch.device) to be set for the model.
##################### Examples
>>> decoder = RNNDecoder(vocab_size=1000)
>>> decoder.set_device(torch.device('cuda:0'))
######## NOTE The device should be a valid torch.device object, which can be created using torch.device(‘cpu’) or torch.device(‘cuda:0’).