espnet2.asr.encoder.vgg_rnn_encoder.VGGRNNEncoder

About 2 min

espnet2.asr.encoder.vgg_rnn_encoder.VGGRNNEncoder

class espnet2.asr.encoder.vgg_rnn_encoder.VGGRNNEncoder(input_size: int, rnn_type: str = 'lstm', bidirectional: bool = True, use_projection: bool = True, num_layers: int = 4, hidden_size: int = 320, output_size: int = 320, dropout: float = 0.0, in_channel: int = 1)

Bases: AbsEncoder

VGGRNNEncoder class for sequence-to-sequence modeling using VGG and RNN.

This encoder combines VGG-based feature extraction with a recurrent neural network (RNN) architecture. It is designed to process sequences of input features and produce a sequence of output features, which can be used for various tasks such as automatic speech recognition (ASR).

output_size

The number of output features from the encoder.

Type: int

rnn_type

Type of RNN used, can be ‘lstm’ or ‘gru’.

Type: str

bidirectional

If True, the RNN will be bidirectional.

Type: bool

use_projection

If True, a projection layer will be used.

Type: bool
Parameters:
- input_size (int) – The number of expected features in the input.
- rnn_type (str , optional) – The type of RNN to use (‘lstm’ or ‘gru’). Defaults to ‘lstm’.
- bidirectional (bool , optional) – If True, the RNN will be bidirectional. Defaults to True.
- use_projection (bool , optional) – Whether to use a projection layer. Defaults to True.
- num_layers (int , optional) – Number of recurrent layers. Defaults to 4.
- hidden_size (int , optional) – The number of hidden features. Defaults to 320.
- output_size (int , optional) – The number of output features. Defaults to 320.
- dropout (float , optional) – Dropout probability. Defaults to 0.0.
- in_channel (int , optional) – Number of input channels. Defaults to 1.
Raises:ValueError – If an unsupported RNN type is specified.

######### Examples

encoder = VGGRNNEncoder(input_size=80) xs_pad = torch.randn(10, 32, 80) # (sequence_length, batch_size, features) ilens = torch.tensor([32] * 10) # input lengths output, lengths, states = encoder(xs_pad, ilens)

####### NOTE The input features should be padded and properly masked before passing them to the forward method.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(xs_pad: Tensor, ilens: Tensor, prev_states: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor]

Processes input tensors through the VGGRNNEncoder.

This method takes padded input sequences and their corresponding lengths, along with previous states, and passes them through the encoder modules. It returns the processed output, updated input lengths, and the current states of the RNN.

Parameters:
- xs_pad (torch.Tensor) – Padded input tensor of shape (T, N, D), where T is the sequence length, N is the batch size, and D is the input feature dimension.
- ilens (torch.Tensor) – Tensor containing the actual lengths of each sequence in the batch of shape (N,).
- prev_states (torch.Tensor , optional) – Previous states of the RNN, default is None. If None, initializes states to None for each encoder module.
Returns: A tuple containing: : - The output tensor after processing (shape: (T, N, D)).
- The updated input lengths tensor (shape: (N,)).
- A list of current states for each encoder module.
Return type: Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

######### Examples

>>> encoder = VGGRNNEncoder(input_size=40)
>>> xs_pad = torch.randn(100, 32, 40)  # (T, N, D)
>>> ilens = torch.tensor([100] * 32)    # (N,)
>>> output, updated_ilens, states = encoder.forward(xs_pad, ilens)

####### NOTE The output tensor will have the same number of features as specified in the output_size attribute of the encoder.

Raises:
- AssertionError – If the length of prev_states does not match
- the number of encoder modules. –

output_size

() → int

Get the output size of the encoder.

This method returns the number of output features produced by the encoder, which is set during the initialization of the VGGRNNEncoder class. The output size is crucial for determining the dimensionality of the data passed to subsequent layers in a neural network.

Returns: The number of output features.
Return type: int

######### Examples

encoder = VGGRNNEncoder(input_size=128, output_size=256) output_size = encoder.output_size() print(output_size) # Output: 256

####### NOTE The output size is defined during the instantiation of the VGGRNNEncoder class and can be used to ensure compatibility with subsequent layers.