espnet2.gan_codec.shared.encoder.seanet_2d.SEANetEncoder2d

About 3 min

espnet2.gan_codec.shared.encoder.seanet_2d.SEANetEncoder2d

class espnet2.gan_codec.shared.encoder.seanet_2d.SEANetEncoder2d(channels: int = 1, dimension: int = 128, n_filters: int = 32, n_residual_layers: int = 1, ratios: List[Tuple[int, int]] = [(4, 1), (4, 1), (4, 2), (4, 1)], activation: str = 'ELU', activation_params: dict = {'alpha': 1.0}, norm: str = 'weight_norm', norm_params: Dict[str, Any] = {}, kernel_size: int = 7, last_kernel_size: int = 7, residual_kernel_size: int = 3, dilation_base: int = 2, causal: bool = False, pad_mode: str = 'reflect', true_skip: bool = False, compress: int = 2, lstm: int = 2, res_seq=True, conv_group_ratio: int = -1)

Bases: Module

SEANet encoder for audio signal processing.

This class implements the SEANet encoder architecture, which is designed to process audio signals through a series of convolutional layers, residual blocks, and optional LSTM layers. The encoder reduces the dimensionality of the input while preserving important features for subsequent decoding.

Parameters:
- channels (int) – Audio channels (default: 1).
- dimension (int) – Intermediate representation dimension (default: 128).
- n_filters (int) – Base width for the model (default: 32).
- n_residual_layers (int) – Number of residual layers (default: 1).
- ratios (List *[*Tuple *[*int , int ] ]) – Kernel size and stride ratios for downsampling (default: [(4, 1), (4, 1), (4, 2), (4, 1)]).
- activation (str) – Activation function (default: “ELU”).
- activation_params (dict) – Parameters for the activation function (default: {“alpha”: 1.0}).
- norm (str) – Normalization method (default: “weight_norm”).
- norm_params (Dict *[*str , Any ]) – Parameters for the normalization method (default: {}).
- kernel_size (int) – Kernel size for the initial convolution (default: 7).
- last_kernel_size (int) – Kernel size for the last convolution (default: 7).
- residual_kernel_size (int) – Kernel size for the residual layers (default: 3).
- dilation_base (int) – Increase factor for dilation with each layer (default: 2).
- causal (bool) – Whether to use fully causal convolution (default: False).
- pad_mode (str) – Padding mode for convolutions (default: “reflect”).
- true_skip (bool) – Use true skip connection or simple convolution for skip connection (default: False).
- compress (int) – Reduced dimensionality in residual branches (default: 2).
- lstm (int) – Number of LSTM layers at the end of the encoder (default: 2).
- res_seq (bool) – Whether to use a residual sequence (default: True).
- conv_group_ratio (int) – Ratio for grouping convolutions (default: -1).

####### Examples

>>> encoder = SEANetEncoder2d(channels=2, dimension=256)
>>> input_tensor = torch.randn(1, 2, 16000)  # (batch_size, channels, time)
>>> output = encoder(input_tensor)
>>> output.shape
torch.Size([1, 256, &lt;time_dim&gt;])  # Time dimension varies based on input

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)

SEANet encoder.

This class implements the SEANet encoder architecture for audio processing, utilizing a series of convolutional layers, residual blocks, and optional LSTM layers to produce an intermediate representation of audio inputs.

channels

Number of audio channels.

Type: int

dimension

Intermediate representation dimension.

Type: int

n_filters

Base width for the model.

Type: int

n_residual_layers

Number of residual layers.

Type: int

ratios

Kernel size and stride ratios for downsampling.

Type: List[Tuple[int, int]]

hop_length

The total hop length calculated from the ratios.

Type: int
Parameters:
- channels (int) – Audio channels. Defaults to 1.
- dimension (int) – Intermediate representation dimension. Defaults to 128.
- n_filters (int) – Base width for the model. Defaults to 32.
- n_residual_layers (int) – Number of residual layers. Defaults to 1.
- ratios (List *[*Tuple *[*int , int ] ]) – Kernel size and stride ratios. Defaults to [(4, 1), (4, 1), (4, 2), (4, 1)].
- activation (str) – Activation function. Defaults to “ELU”.
- activation_params (dict) – Parameters for the activation function. Defaults to {“alpha”: 1.0}.
- norm (str) – Normalization method. Defaults to “weight_norm”.
- norm_params (Dict *[*str , Any ]) – Parameters for the underlying normalization.
- kernel_size (int) – Kernel size for the initial convolution. Defaults to 7.
- last_kernel_size (int) – Kernel size for the last convolution. Defaults to 7.
- residual_kernel_size (int) – Kernel size for the residual layers. Defaults to 3.
- dilation_base (int) – How much to increase the dilation with each layer. Defaults to 2.
- causal (bool) – Whether to use fully causal convolution. Defaults to False.
- pad_mode (str) – Padding mode for convolutions. Defaults to “reflect”.
- true_skip (bool) – Whether to use true skip connection or a simple convolution. Defaults to False.
- compress (int) – Reduced dimensionality in residual branches. Defaults to 2.
- lstm (int) – Number of LSTM layers at the end of the encoder. Defaults to 2.
- res_seq (bool) – Flag to indicate whether to apply sequential processing. Defaults to True.
- conv_group_ratio (int) – Group ratio for convolutions. Defaults to -1.

####### Examples

>>> encoder = SEANetEncoder2d(channels=2, dimension=256)
>>> audio_input = torch.randn(1, 2, 16000)  # Batch size of 1, 2 channels, 16000 samples
>>> output = encoder(audio_input)
>>> output.shape
torch.Size([1, 256, T])  # Output shape depends on the internal configuration

Raises:AssertionError – If the number of kernel sizes does not match the number of dilations.