espnet2.asr_transducer.encoder.blocks.conformer.Conformer

About 4 min

espnet2.asr_transducer.encoder.blocks.conformer.Conformer

class espnet2.asr_transducer.encoder.blocks.conformer.Conformer(block_size: int, self_att: ~torch.nn.modules.module.Module, feed_forward: ~torch.nn.modules.module.Module, feed_forward_macaron: ~torch.nn.modules.module.Module, conv_mod: ~torch.nn.modules.module.Module, norm_class: ~torch.nn.modules.module.Module = <class 'torch.nn.modules.normalization.LayerNorm'>, norm_args: ~typing.Dict = {}, dropout_rate: float = 0.0)

Bases: Module

Conformer block for Transducer encoder.

This module implements a Conformer block, which is a type of neural network architecture that combines convolutional layers and self-attention mechanisms. It is designed to be used within an automatic speech recognition (ASR) transducer encoder.

self_att

Self-attention module instance.

Type: torch.nn.Module

feed_forward

Feed-forward module instance.

Type: torch.nn.Module

feed_forward

_macaron

Feed-forward module instance for macaron network.

Type: torch.nn.Module

conv_mod

Convolution module instance.

Type: torch.nn.Module

norm_feed_forward

Normalization module for the feed-forward component.

Type: torch.nn.Module

norm_self_att

Normalization module for the self-attention component.

Type: torch.nn.Module

norm_macaron

Normalization module for the macaron feed-forward component.

Type: torch.nn.Module

norm_conv

Normalization module for the convolution component.

Type: torch.nn.Module

norm_final

Final normalization module.

Type: torch.nn.Module

dropout

Dropout layer for regularization.

Type: torch.nn.Dropout

block_size

Input/output size of the block.

Type: int

cache

Caches for self-attention and convolution modules during streaming.

Type: Optional[List[torch.Tensor]]
Parameters:
- block_size (int) – Input/output size.
- self_att (torch.nn.Module) – Self-attention module instance.
- feed_forward (torch.nn.Module) – Feed-forward module instance.
- feed_forward_macaron (torch.nn.Module) – Feed-forward module instance for macaron network.
- conv_mod (torch.nn.Module) – Convolution module instance.
- norm_class (torch.nn.Module , optional) – Normalization module class. Defaults to torch.nn.LayerNorm.
- norm_args (Dict , optional) – Normalization module arguments. Defaults to {}.
- dropout_rate (float , optional) – Dropout rate. Defaults to 0.0.

reset_streaming_cache(left_context

int, device: torch.device) -> None: Initialize/Reset self-attention and convolution modules cache for streaming.

forward(x

torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, chunk_mask: Optional[torch.Tensor] = None) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Encode input sequences.

chunk_forward(x

torch.Tensor, pos_enc: torch.Tensor, mask: torch.Tensor, left_context: int = 0) -> Tuple[torch.Tensor, torch.Tensor]: Encode chunk of input sequence.

########### Examples

Example of creating a Conformer block

conformer_block = Conformer(

block_size=256, self_att=SelfAttentionModule(), feed_forward=FeedForwardModule(), feed_forward_macaron=FeedForwardModule(), conv_mod=ConvolutionModule(), norm_class=torch.nn.LayerNorm, norm_args={‘eps’: 1e-5}, dropout_rate=0.1

)

Example of using the forward method

output, mask, pos_enc = conformer_block.forward(

x=torch.randn(10, 20, 256), # Batch size 10, sequence length 20, feature size 256 pos_enc=torch.randn(10, 38, 256), # Positional encoding mask=torch.ones(10, 20) # Source mask

)

Example of using the reset_streaming_cache method

conformer_block.reset_streaming_cache(left_context=5, device=’cuda’)

Example of using the chunk_forward method

chunk_output, updated_pos_enc = conformer_block.chunk_forward(

x=torch.randn(10, 15, 256), # Batch size 10, sequence length 15, feature size 256 pos_enc=torch.randn(10, 30, 256), # Positional encoding mask=torch.ones(10, 15), # Source mask left_context=5

)

Construct a Conformer object.

chunk_forward(x

: Tensor, pos_enc: Tensor, mask: Tensor, left_context: int = 0) → Tuple[Tensor, Tensor]

Encode chunk of input sequence.

This method processes a chunk of the input sequence, utilizing the self-attention mechanism while considering a specified number of previous frames as context. It updates the internal cache for streaming purposes, allowing the model to maintain state across chunks.

Parameters:
- x – Conformer input sequences. Shape: (B, T, D_block)
- pos_enc – Positional embedding sequences. Shape: (B, 2 * (T - 1), D_block)
- mask – Source mask. Shape: (B, T_2)
- left_context – Number of previous frames the attention module can see in the current chunk. Default is 0.
Returns: x: Conformer output sequences. Shape: (B, T, D_block) pos_enc: Positional embedding sequences. Shape: (B, 2 * (T - 1), D_block)
Return type: Tuple[torch.Tensor, torch.Tensor]

########### Examples

>>> conformer = Conformer(block_size=128, self_att=..., feed_forward=...,
... feed_forward_macaron=..., conv_mod=..., norm_class=torch.nn.LayerNorm)
>>> x = torch.randn(10, 20, 128)  # Batch size 10, sequence length 20
>>> pos_enc = torch.randn(10, 38, 128)  # Positional encodings
>>> mask = torch.ones(10, 20)  # Source mask
>>> output, updated_pos_enc = conformer.chunk_forward(x, pos_enc, mask,
... left_context=5)

####### NOTE The method modifies the internal cache, which is used to retain information from previous chunks, enhancing the performance in streaming scenarios.

forward(x

: Tensor, pos_enc: Tensor, mask: Tensor, chunk_mask: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor]

Encode input sequences through the Conformer module.

This method processes the input sequences using self-attention, convolution, and feed-forward layers, applying normalization and dropout as specified.

Parameters:
- x (torch.Tensor) – Conformer input sequences of shape (B, T, D_block).
- pos_enc (torch.Tensor) – Positional embedding sequences of shape (B, 2 * (T - 1), D_block).
- mask (torch.Tensor) – Source mask of shape (B, T).
- chunk_mask (Optional *[*torch.Tensor ]) – Optional chunk mask of shape (T_2, T_2).
Returns:
- x (torch.Tensor): Conformer output sequences of shape (B, T, D_block).
- mask (torch.Tensor): Source mask of shape (B, T).
- pos_enc (torch.Tensor): Positional embedding sequences of shape : (B, 2 * (T - 1), D_block).
Return type: Tuple[torch.Tensor, torch.Tensor, torch.Tensor]

########### Examples

>>> model = Conformer(block_size=128,
...                   self_att=self_attention_module,
...                   feed_forward=feed_forward_module,
...                   feed_forward_macaron=feed_forward_macaron_module,
...                   conv_mod=conv_module)
>>> input_sequences = torch.randn(32, 10, 128)
>>> positional_encodings = torch.randn(32, 18, 128)
>>> source_mask = torch.ones(32, 10)
>>> output, mask, pos_enc = model(input_sequences, positional_encodings,
...                                source_mask)

####### NOTE This method is designed to be used within the context of the Conformer architecture and requires the input tensors to be appropriately shaped and normalized before being passed in.

reset_streaming_cache(left_context

: int, device: device) → None

Initialize or reset the streaming cache for self-attention and convolution modules.

This method sets up the cache used by the self-attention and convolution layers to facilitate streaming processing of input sequences. The cache is initialized based on the specified left_context, which determines how many previous frames the attention module can access for the current chunk of input data.

Parameters:
- left_context – An integer representing the number of previous frames that the attention module can see in the current chunk.
- device – The device (CPU or GPU) to use for creating the cache tensors.

########### Examples

>>> conformer = Conformer(...)
>>> conformer.reset_streaming_cache(left_context=5, device=torch.device('cuda'))

####### NOTE The cache is a list containing two tensors: the first tensor holds the self-attention cache, and the second tensor holds the convolution cache. The shape of the tensors is determined by the block_size and the convolution module’s kernel size.