espnet2.asr_transducer.encoder.blocks.conformer.Conformer
espnet2.asr_transducer.encoder.blocks.conformer.Conformer
class espnet2.asr_transducer.encoder.blocks.conformer.Conformer(block_size: int, self_att: ~torch.nn.modules.module.Module, feed_forward: ~torch.nn.modules.module.Module, feed_forward_macaron: ~torch.nn.modules.module.Module, conv_mod: ~torch.nn.modules.module.Module, norm_class: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, norm_args: ~typing.Dict = {}, dropout_rate: float = 0.0)
Bases: Module
Conformer block for Transducer encoder.
This module implements a Conformer block, which is a type of neural network architecture that combines convolutional layers and self-attention mechanisms. It is designed to be used within an automatic speech recognition (ASR) transducer encoder.
self_att
Self-attention module instance.
- Type: torch.nn.Module
feed_forward
Feed-forward module instance.
- Type: torch.nn.Module
feed_forward
Feed-forward module instance for macaron network.
- Type: torch.nn.Module
conv_mod
Convolution module instance.
- Type: torch.nn.Module
norm_feed_forward
Normalization module for the feed-forward component.
- Type: torch.nn.Module
norm_self_att
Normalization module for the self-attention component.
- Type: torch.nn.Module
norm_macaron
Normalization module for the macaron feed-forward component.
- Type: torch.nn.Module
norm_conv
Normalization module for the convolution component.
- Type: torch.nn.Module
norm_final
Final normalization module.
- Type: torch.nn.Module
dropout
Dropout layer for regularization.
- Type: torch.nn.Dropout
block_size
Input/output size of the block.
- Type: int
cache
Caches for self-attention and convolution modules during streaming.
Type: Optional[List[torch.Tensor]]
Parameters:
- block_size (int) – Input/output size.
- self_att (torch.nn.Module) – Self-attention module instance.
- feed_forward (torch.nn.Module) – Feed-forward module instance.
- feed_forward_macaron (torch.nn.Module) – Feed-forward module instance for macaron network.
- conv_mod (torch.nn.Module) – Convolution module instance.
- norm_class (torch.nn.Module , optional) – Normalization module class. Defaults to torch.nn.LayerNorm.
- norm_args (Dict , optional) – Normalization module arguments. Defaults to {}.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.0.
reset_streaming_cache(left_context
int, device: torch.device) -> None: Initialize/Reset self-attention and convolution modules cache for streaming.
forward(x
torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Encode input sequences.
chunk_forward(x
torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, left_context: int = 0) -> Tuple[torch.Tensor, torch.Tensor]: Encode chunk of input sequence.
########### Examples
Example of creating a Conformer block
conformer_block = Conformer(
block_size=256, self_att=SelfAttentionModule(), feed_forward=FeedForwardModule(), feed_forward_macaron=FeedForwardModule(), conv_mod=ConvolutionModule(), norm_class=torch.nn.LayerNorm, norm_args={‘eps’: 1e-5}, dropout_rate=0.1
)
Example of using the forward method
output, mask, pos_enc = conformer_block.forward(
x=torch.randn(10, 20, 256), # Batch size 10, sequence length 20, feature size 256 pos_enc=torch.randn(10, 38, 256), # Positional encoding mask=torch.ones(10, 20) # Source mask
)
Example of using the reset_streaming_cache method
conformer_block.reset_streaming_cache(left_context=5, device=’cuda’)
Example of using the chunk_forward method
chunk_output, updated_pos_enc = conformer_block.chunk_forward(
x=torch.randn(10, 15, 256), # Batch size 10, sequence length 15, feature size 256 pos_enc=torch.randn(10, 30, 256), # Positional encoding mask=torch.ones(10, 15), # Source mask left_context=5
)
Construct a Conformer object.
#
chunk_forward(x
Encode chunk of input sequence.
This method processes a chunk of the input sequence, utilizing the self-attention mechanism while considering a specified number of previous frames as context. It updates the internal cache for streaming purposes, allowing the model to maintain state across chunks.
- Parameters:
- x – Conformer input sequences. Shape: (B, T, D_block)
- pos_enc – Positional embedding sequences. Shape: (B, 2 * (T - 1), D_block)
- mask – Source mask. Shape: (B, T_2)
- left_context – Number of previous frames the attention module can see in the current chunk. Default is 0.
- Returns: x: Conformer output sequences. Shape: (B, T, D_block) pos_enc: Positional embedding sequences. Shape: (B, 2 * (T - 1), D_block)
- Return type: Tuple[torch.Tensor, torch.Tensor]
########### Examples
>>> conformer = Conformer(block_size=128, self_att=..., feed_forward=...,
... feed_forward_macaron=..., conv_mod=..., norm_class=torch.nn.LayerNorm)
>>> x = torch.randn(10, 20, 128) # Batch size 10, sequence length 20
>>> pos_enc = torch.randn(10, 38, 128) # Positional encodings
>>> mask = torch.ones(10, 20) # Source mask
>>> output, updated_pos_enc = conformer.chunk_forward(x, pos_enc, mask,
... left_context=5)
####### NOTE The method modifies the internal cache, which is used to retain information from previous chunks, enhancing the performance in streaming scenarios.
#
forward(x
Encode input sequences through the Conformer module.
This method processes the input sequences using self-attention, convolution, and feed-forward layers, applying normalization and dropout as specified.
- Parameters:
- x (torch.Tensor) – Conformer input sequences of shape (B, T, D_block).
- pos_enc (torch.Tensor) – Positional embedding sequences of shape (B, 2 * (T - 1), D_block).
- mask (torch.Tensor) – Source mask of shape (B, T).
- chunk_mask (Optional *[*torch.Tensor ]) – Optional chunk mask of shape (T_2, T_2).
- Returns:
- x (torch.Tensor): Conformer output sequences of shape (B, T, D_block).
- mask (torch.Tensor): Source mask of shape (B, T).
- pos_enc (torch.Tensor): Positional embedding sequences of shape : (B, 2 * (T - 1), D_block).
- Return type: Tuple[torch.Tensor, torch.Tensor, torch.Tensor]
########### Examples
>>> model = Conformer(block_size=128,
... self_att=self_attention_module,
... feed_forward=feed_forward_module,
... feed_forward_macaron=feed_forward_macaron_module,
... conv_mod=conv_module)
>>> input_sequences = torch.randn(32, 10, 128)
>>> positional_encodings = torch.randn(32, 18, 128)
>>> source_mask = torch.ones(32, 10)
>>> output, mask, pos_enc = model(input_sequences, positional_encodings,
... source_mask)
####### NOTE This method is designed to be used within the context of the Conformer architecture and requires the input tensors to be appropriately shaped and normalized before being passed in.
#
reset_streaming_cache(left_context
Initialize or reset the streaming cache for self-attention and convolution modules.
This method sets up the cache used by the self-attention and convolution layers to facilitate streaming processing of input sequences. The cache is initialized based on the specified left_context, which determines how many previous frames the attention module can access for the current chunk of input data.
- Parameters:
- left_context – An integer representing the number of previous frames that the attention module can see in the current chunk.
- device – The device (CPU or GPU) to use for creating the cache tensors.
########### Examples
>>> conformer = Conformer(...)
>>> conformer.reset_streaming_cache(left_context=5, device=torch.device('cuda'))
####### NOTE The cache is a list containing two tensors: the first tensor holds the self-attention cache, and the second tensor holds the convolution cache. The shape of the tensors is determined by the block_size and the convolution module’s kernel size.