espnet2.asr_transducer.decoder.rnn_decoder.RNNDecoder

About 8 min

espnet2.asr_transducer.decoder.rnn_decoder.RNNDecoder

class espnet2.asr_transducer.decoder.rnn_decoder.RNNDecoder(vocab_size: int, embed_size: int = 256, hidden_size: int = 256, rnn_type: str = 'lstm', num_layers: int = 1, dropout_rate: float = 0.0, embed_dropout_rate: float = 0.0, embed_pad: int = 0)

Bases: AbsDecoder

RNN decoder definition for Transducer models.

This class implements an RNN decoder module used in Transducer models. It supports both LSTM and GRU architectures and allows for customization of various parameters such as embedding size, hidden size, and dropout rates.

embed

Embedding layer for the input labels.

Type: torch.nn.Embedding

dropout_embed

Dropout layer for the embedding.

Type: torch.nn.Dropout

rnn

List of RNN layers (LSTM/GRU).

Type: torch.nn.ModuleList

dropout_rnn

List of dropout layers for RNN outputs.

Type: torch.nn.ModuleList

dlayers

Number of decoder layers.

Type: int

dtype

Type of RNN used (‘lstm’ or ‘gru’).

Type: str

output_size

Size of the output from the decoder.

Type: int

vocab_size

Size of the vocabulary.

Type: int

device

Device to run the model on (CPU/GPU).

Type: torch.device

score_cache

Cache for storing scores of previous hypotheses.

Type: dict
Parameters:
- vocab_size (int) – Vocabulary size.
- embed_size (int , optional) – Embedding size. Default is 256.
- hidden_size (int , optional) – Hidden size. Default is 256.
- rnn_type (str , optional) – Decoder layers type (‘lstm’ or ‘gru’). Default is ‘lstm’.
- num_layers (int , optional) – Number of decoder layers. Default is 1.
- dropout_rate (float , optional) – Dropout rate for decoder layers. Default is 0.0.
- embed_dropout_rate (float , optional) – Dropout rate for embedding layer. Default is 0.0.
- embed_pad (int , optional) – Embedding padding symbol ID. Default is 0.

##################### Examples

Create an RNNDecoder instance

decoder = RNNDecoder(vocab_size=1000, embed_size=256, hidden_size=512)

Forward pass with a batch of label sequences

labels = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.long) output = decoder(labels)

Initialize decoder states

states = decoder.init_state(batch_size=2)

One-step forward hypothesis scoring

out, new_states = decoder.score(label_sequence=[1, 2], states=states)

######## NOTE The decoder supports only ‘lstm’ and ‘gru’ as valid RNN types. Attempting to use any other type will raise a ValueError.

Construct a RNNDecoder object.

batch_score(hyps: List[Hypothesis]) → Tuple[Tensor, Tuple[Tensor, Tensor | None]]

One-step forward hypotheses.

This method takes a list of hypotheses and computes the decoder output sequences for each hypothesis. It utilizes the last label from each hypothesis to generate the embeddings and feed them into the RNN.

Parameters:hyps – A list of Hypothesis objects, each containing a sequence of label IDs and the corresponding decoder states.
Returns: Decoder output sequences of shape (B, D_dec), where B is the : batch size and D_dec is the decoder output dimension.
states: Decoder hidden states in the form of a tuple containing : two elements: : - The hidden states of shape ((N, B, D_dec), …) - The cell states (only present if using LSTM), also of shape ((N, B, D_dec), …).
Return type: out

##################### Examples

>>> from espnet2.asr_transducer.decoder.rnn_decoder import RNNDecoder
>>> from espnet2.asr_transducer.beam_search_transducer import Hypothesis
>>> decoder = RNNDecoder(vocab_size=100, embed_size=64, hidden_size=128)
>>> hyps = [Hypothesis(yseq=[1, 2, 3], dec_state=None),
...          Hypothesis(yseq=[4, 5, 6], dec_state=None)]
>>> out, states = decoder.batch_score(hyps)
>>> print(out.shape)  # Output shape should be (2, 128)

create_batch_states(new_states: List[Tuple[Tensor, Tensor | None]]) → Tuple[Tensor, Tensor | None]

Create decoder hidden states.

Parameters:new_states – List of decoder hidden states, where each state is a tuple of tensors. Each tensor corresponds to a specific hypothesis, and the format is as follows:
- For LSTM: (N, 1, D_dec)
- For GRU: (N, 1, D_dec) or None
Returns: Combined decoder hidden states. The output is a tuple of tensors : structured as:
- For LSTM: ((N, B, D_dec), (N, B, D_dec))
- For GRU: ((N, B, D_dec), None)
Return type: states

##################### Examples

>>> new_states = [(torch.zeros(2, 1, 256), torch.zeros(2, 1, 256)),
...               (torch.zeros(2, 1, 256), torch.zeros(2, 1, 256))]
>>> batch_states = create_batch_states(new_states)
>>> print(batch_states[0].shape)  # Output: (2, 2, 256)
>>> print(batch_states[1].shape)  # Output: (2, 2, 256)

forward(labels: Tensor) → Tensor

RNN decoder definition for Transducer models.

This module implements an RNN-based decoder for use in Transducer models, utilizing either LSTM or GRU architectures. It is designed to process label sequences and output decoder states for further processing in sequence modeling tasks.

embed

Embedding layer for the input label sequences.

dropout_embed

Dropout layer for the embedding output.

rnn

List of RNN layers (LSTM or GRU).

dropout_rnn

List of dropout layers for the RNN outputs.

dlayers

Number of decoder layers.

dtype

Type of RNN used (‘lstm’ or ‘gru’).

output_size

Size of the output from the decoder.

vocab_size

Size of the vocabulary.

device

Device on which the model is stored (CPU or GPU).

score_cache

Cache for storing previously computed scores.

Parameters:
- vocab_size (int) – Vocabulary size.
- embed_size (int , optional) – Size of the embedding layer. Default is 256.
- hidden_size (int , optional) – Size of the hidden layers. Default is 256.
- rnn_type (str , optional) – Type of RNN layers (‘lstm’ or ‘gru’). Default is ‘lstm’.
- num_layers (int , optional) – Number of decoder layers. Default is 1.
- dropout_rate (float , optional) – Dropout rate for decoder layers. Default is 0.0.
- embed_dropout_rate (float , optional) – Dropout rate for embedding layer. Default is 0.0.
- embed_pad (int , optional) – Padding symbol ID for the embedding layer. Default is 0.

##################### Examples

Initialize the RNNDecoder

decoder = RNNDecoder(vocab_size=1000, embed_size=256, hidden_size=256)

Forward pass with labels

labels = torch.randint(0, 1000, (32, 10)) # (B, L) output = decoder.forward(labels) # (B, U, D_dec)

Returns: Decoder output sequences of shape (B, U, D_dec).
Return type: out (torch.Tensor)
Raises:ValueError – If the specified rnn_type is not supported (not ‘lstm’ or ‘gru’).

init_state(batch_size: int) → Tuple[Tensor, tensor | None]

RNN decoder definition for Transducer models.

This module implements an RNN decoder for use in transducer models, providing a mechanism to process sequences of data through recurrent neural networks. The decoder can be configured with various parameters to adjust its architecture and behavior.

vocab_size

Size of the vocabulary.

Type: int

embed

_size

Size of the embeddings.

Type: int

hidden_size

Size of the hidden states.

Type: int

rnn

_type

Type of RNN used (‘lstm’ or ‘gru’).

Type: str

num_layers

Number of layers in the decoder.

Type: int

dropout_rate

Dropout rate applied to decoder layers.

Type: float

embed

_dropout_rate

Dropout rate applied to embedding layer.

Type: float

embed

_pad

Padding symbol ID for embeddings.

Type: int
Parameters:
- vocab_size (int) – Vocabulary size.
- embed_size (int , optional) – Embedding size (default is 256).
- hidden_size (int , optional) – Hidden size (default is 256).
- rnn_type (str , optional) – Decoder layers type (‘lstm’ or ‘gru’, default is ‘lstm’).
- num_layers (int , optional) – Number of decoder layers (default is 1).
- dropout_rate (float , optional) – Dropout rate for decoder layers (default is 0.0).
- embed_dropout_rate (float , optional) – Dropout rate for embedding layer (default is 0.0).
- embed_pad (int , optional) – Embedding padding symbol ID (default is 0).

##################### Examples

>>> decoder = RNNDecoder(vocab_size=5000, embed_size=256, hidden_size=256)
>>> input_tensor = torch.randint(0, 5000, (32, 10))  # Batch of 32, seq len 10
>>> output = decoder(input_tensor)
>>> output.shape
torch.Size([32, 10, 256])  # (Batch, Sequence Length, Hidden Size)

Raises:ValueError – If rnn_type is not ‘lstm’ or ‘gru’.

rnn

_forward(x: Tensor, state: Tuple[Tensor, Tensor | [None](../asr/AbsDecoder.md#espnet2.asr.decoder.abs_decoder.AbsDecoder.None)]) → Tuple[Tensor, Tuple[Tensor, Tensor | [None](../asr/AbsDecoder.md#espnet2.asr.decoder.abs_decoder.AbsDecoder.None)]]

RNN decoder definition for Transducer models.

This module implements an RNN decoder for Transducer models, utilizing either LSTM or GRU architectures. The decoder processes input sequences and produces output sequences, making it suitable for applications in automatic speech recognition (ASR).

vocab_size

Size of the vocabulary.

embed

_size

Size of the embedding layer.

hidden_size

Size of the hidden layers.

dtype

Type of RNN used (‘lstm’ or ‘gru’).

dlayers

Number of decoder layers.

score_cache

Cache for storing computed scores for efficiency.

Parameters:
- vocab_size (int) – Vocabulary size.
- embed_size (int , optional) – Embedding size. Default is 256.
- hidden_size (int , optional) – Hidden size. Default is 256.
- rnn_type (str , optional) – Type of RNN layers (‘lstm’ or ‘gru’). Default is ‘lstm’.
- num_layers (int , optional) – Number of decoder layers. Default is 1.
- dropout_rate (float , optional) – Dropout rate for decoder layers. Default is 0.0.
- embed_dropout_rate (float , optional) – Dropout rate for embedding layer. Default is 0.0.
- embed_pad (int , optional) – Embedding padding symbol ID. Default is 0.

##################### Examples

Initialize RNNDecoder

decoder = RNNDecoder(vocab_size=1000, embed_size=256, hidden_size=256)

Forward pass with label sequences

labels = torch.randint(0, 1000, (32, 10)) # Batch of 32 sequences of length 10 output = decoder(labels)

One-step scoring

label_sequence = [1, 2, 3] states = decoder.init_state(batch_size=1) out, states = decoder.score(label_sequence, states)

######## NOTE The decoder supports both LSTM and GRU architectures, but the choice should be made based on the specific requirements of the task.

score(label_sequence: List[int], states: Tuple[Tensor, Tensor | None]) → Tuple[Tensor, Tuple[Tensor, Tensor | None]]

RNN decoder definition for Transducer models.

This module implements an RNN-based decoder for Transducer models, allowing for sequence-to-sequence tasks. The decoder can utilize either LSTM or GRU cells and supports multiple layers and dropout for regularization.

embed

Embedding layer for input sequences.

dropout_embed

Dropout layer for embeddings.

rnn

List of RNN layers (LSTM or GRU).

dropout_rnn

List of dropout layers for RNN outputs.

dlayers

Number of decoder layers.

dtype

Type of RNN used (‘lstm’ or ‘gru’).

output_size

Size of the decoder output.

vocab_size

Size of the vocabulary.

device

Device (CPU or GPU) on which the model resides.

score_cache

Cache for previously computed scores to avoid redundant calculations.

Parameters:
- vocab_size (int) – Vocabulary size.
- embed_size (int) – Embedding size (default: 256).
- hidden_size (int) – Hidden size (default: 256).
- rnn_type (str) – Type of RNN layers (‘lstm’ or ‘gru’, default: ‘lstm’).
- num_layers (int) – Number of decoder layers (default: 1).
- dropout_rate (float) – Dropout rate for decoder layers (default: 0.0).
- embed_dropout_rate (float) – Dropout rate for embedding layer (default: 0.0).
- embed_pad (int) – Embedding padding symbol ID (default: 0).

##################### Examples

decoder = RNNDecoder(vocab_size=1000, embed_size=256, hidden_size=256) labels = torch.tensor([[1, 2, 3], [4, 5, 6]]) # Example label sequences output = decoder.forward(labels)

Raises:ValueError – If an unsupported rnn_type is provided during initialization.

select_state(states: Tuple[Tensor, Tensor | None], idx: int) → Tuple[Tensor, Tensor | None]

Get specified ID state from decoder hidden states.

Parameters:
- states – Decoder hidden states. ((N, B, D_dec), (N, B, D_dec) or None)
- idx – State ID to extract.
Returns: Decoder hidden state for given ID. ((N, 1, D_dec), (N, 1, D_dec) or None)

##################### Examples

>>> decoder = RNNDecoder(vocab_size=10)
>>> states = decoder.init_state(batch_size=2)
>>> selected_state = decoder.select_state(states, idx=0)
>>> print(selected_state)
(tensor(...), tensor(...) or None)

######## NOTE The function assumes that the states are in the expected format.

set_device(device: device) → None

Set the GPU device to use for the RNN decoder.

This method updates the device attribute of the RNNDecoder class, allowing the model to run on the specified device (CPU or GPU).

Parameters:device – The device ID (torch.device) to be set for the model.

##################### Examples

>>> decoder = RNNDecoder(vocab_size=1000)
>>> decoder.set_device(torch.device('cuda:0'))

######## NOTE The device should be a valid torch.device object, which can be created using torch.device(‘cpu’) or torch.device(‘cuda:0’).