espnet2.asr.encoder.rnn_encoder.RNNEncoder

About 2 min

espnet2.asr.encoder.rnn_encoder.RNNEncoder

class espnet2.asr.encoder.rnn_encoder.RNNEncoder(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, subsample: Sequence[int] | None = (2, 2, 1, 1))

Bases: AbsEncoder

RNNEncoder class for sequence-to-sequence models using recurrent neural networks.

This class implements an RNN-based encoder for processing sequential data, such as speech signals in automatic speech recognition (ASR) tasks. It allows for the use of either LSTM or GRU cells and supports bidirectional processing and projection layers.

rnn_type

Type of RNN to use (‘lstm’ or ‘gru’).

Type: str

bidirectional

If True, uses a bidirectional RNN.

Type: bool

use_projection

If True, applies a projection layer.

Type: bool

_output_size

The number of output features.

Type: int
Parameters:
- input_size (int) – The number of expected features in the input.
- rnn_type (str , optional) – Type of RNN (‘lstm’ or ‘gru’). Default is ‘lstm’.
- bidirectional (bool , optional) – Whether to use a bidirectional RNN. Default is True.
- use_projection (bool , optional) – Whether to use a projection layer. Default is True.
- num_layers (int , optional) – Number of recurrent layers. Default is 4.
- hidden_size (int , optional) – Number of hidden features. Default is 320.
- output_size (int , optional) – Number of output features. Default is 320.
- dropout (float , optional) – Dropout probability. Default is 0.0.
- subsample (Sequence *[*int ] , optional) – Subsampling factors for each layer. Default is (2, 2, 1, 1).
Raises:ValueError – If the provided rnn_type is not supported.

######### Examples

Initialize an RNNEncoder

encoder = RNNEncoder(input_size=40, hidden_size=256, output_size=256)

Forward pass through the encoder

xs_pad = torch.randn(10, 5, 40) # (sequence_length, batch_size, input_size) ilens = torch.tensor([5, 4, 3, 5, 2]) # Actual lengths of sequences output, lengths, states = encoder(xs_pad, ilens)

####### NOTE This encoder can be easily integrated into larger ASR systems and supports various configurations based on task requirements.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(xs_pad: Tensor, ilens: Tensor, prev_states: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor]

Process input sequences through the RNN encoder.

This method takes padded input sequences along with their lengths and optionally previous states, and passes them through the RNN encoder. The output consists of the encoded features, updated input lengths, and the current states of the RNN.

Parameters:
- xs_pad (torch.Tensor) – Padded input sequences of shape (T, N, C), where T is the maximum sequence length, N is the batch size, and C is the number of input features.
- ilens (torch.Tensor) – Lengths of each sequence in the batch of shape (N,).
- prev_states (torch.Tensor , optional) – Previous states of the RNN, defaults to None, which initializes states to None.
Returns: A tuple containing: : - xs_pad (torch.Tensor): The processed padded input sequences after passing through the RNN encoder.
- ilens (torch.Tensor): Updated lengths of each sequence after processing.
- current_states (torch.Tensor): The updated states of the RNN after processing.
Return type: Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

######### Examples

>>> encoder = RNNEncoder(input_size=10, output_size=20)
>>> xs_pad = torch.rand(5, 3, 10)  # (T=5, N=3, C=10)
>>> ilens = torch.tensor([5, 3, 4])  # Lengths of sequences
>>> output, lengths, states = encoder.forward(xs_pad, ilens)

####### NOTE Ensure that xs_pad is padded correctly and that ilens corresponds to the actual lengths of the sequences in the batch.

Raises:
- AssertionError – If the length of prev_states does not match
- the number of RNN layers. –

output_size() → int

Return the size of the output features.

This method retrieves the number of output features defined during the initialization of the RNNEncoder. The output size is crucial for subsequent layers in a neural network model, ensuring that the output dimensions match the expected input dimensions of any following layers or components.

Returns: The number of output features defined in the RNNEncoder.
Return type: int

######### Examples

>>> encoder = RNNEncoder(input_size=128, output_size=256)
>>> encoder.output_size()
256

####### NOTE This method is particularly useful when you need to understand the dimensions of the output from the RNN layer, especially when designing architectures that require precise input-output shape matching.