espnet2.gan_codec.shared.encoder.seanet.SEANetEncoder

About 4 min

espnet2.gan_codec.shared.encoder.seanet.SEANetEncoder

class espnet2.gan_codec.shared.encoder.seanet.SEANetEncoder(channels: int = 1, dimension: int = 128, n_filters: int = 32, n_residual_layers: int = 1, ratios: List[int] = [8, 5, 4, 2], activation: str = 'ELU', activation_params: dict = {'alpha': 1.0}, norm: str = 'weight_norm', norm_params: Dict[str, Any] = {}, kernel_size: int = 7, last_kernel_size: int = 7, residual_kernel_size: int = 3, dilation_base: int = 2, causal: bool = False, pad_mode: str = 'reflect', true_skip: bool = False, compress: int = 2, lstm: int = 2)

Bases: Module

SEANet encoder.

This class implements the SEANet encoder, which is a neural network architecture designed for audio processing tasks. The encoder utilizes convolutional layers, residual blocks, and optional LSTM layers to extract features from audio input.

channels

Number of audio channels (default is 1).

Type: int

dimension

Dimension of the intermediate representation (default is 128).

Type: int

n_filters

Base width for the model (default is 32).

Type: int

n_residual_layers

Number of residual layers (default is 1).

Type: int

ratios

Downsampling ratios (default is [8, 5, 4, 2]).

Type: List[int]

activation

Activation function (default is “ELU”).

Type: str

activation

_params

Parameters for the activation function (default is {“alpha”: 1.0}).

Type: dict

norm

Normalization method (default is “weight_norm”).

Type: str

norm

_params

Parameters for the underlying normalization used with the convolution (default is an empty dictionary).

Type: dict

kernel_size

Kernel size for the initial convolution (default is 7).

Type: int

last_kernel_size

Kernel size for the last convolution (default is 7).

Type: int

residual_kernel_size

Kernel size for the residual layers (default is 3).

Type: int

dilation_base

Base value for increasing dilation with each layer (default is 2).

Type: int

causal

Whether to use fully causal convolution (default is False).

Type: bool

pad_mode

Padding mode for convolutions (default is “reflect”).

Type: str

true_skip

Whether to use true skip connections or a simple convolution as the skip connection in the residual blocks (default is False).

Type: bool

compress

Reduced dimensionality in residual branches (default is 2).

Type: int

lstm

Number of LSTM layers at the end of the encoder (default is 2).

Type: int
Parameters:
- channels (int) – Audio channels.
- dimension (int) – Intermediate representation dimension.
- n_filters (int) – Base width for the model.
- n_residual_layers (int) – Number of residual layers.
- ratios (Sequence *[*int ]) – Kernel size and stride ratios. The encoder uses downsampling ratios instead of upsampling ratios, hence it will use the ratios in the reverse order to the ones specified here that must match the decoder order.
- activation (str) – Activation function.
- activation_params (dict) – Parameters to provide to the activation function.
- norm (str) – Normalization method.
- norm_params (dict) – Parameters to provide to the underlying normalization used along with the convolution.
- kernel_size (int) – Kernel size for the initial convolution.
- last_kernel_size (int) – Kernel size for the last convolution.
- residual_kernel_size (int) – Kernel size for the residual layers.
- dilation_base (int) – How much to increase the dilation with each layer.
- causal (bool) – Whether to use fully causal convolution.
- pad_mode (str) – Padding mode for the convolutions.
- true_skip (bool) – Whether to use true skip connection or a simple (streamable) convolution as the skip connection in the residual network blocks.
- compress (int) – Reduced dimensionality in residual branches (from Demucs v3).
- lstm (int) – Number of LSTM layers at the end of the encoder.

####### Examples

>>> encoder = SEANetEncoder(channels=1, dimension=128)
>>> input_tensor = torch.randn(1, 1, 16000)  # Batch size of 1, 1 channel, 16000 samples
>>> output = encoder(input_tensor)
>>> print(output.shape)
torch.Size([1, 128, 2000])  # Example output shape

NOTE

This encoder is part of a larger audio processing framework and is intended for use in GAN-based audio synthesis tasks.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x)

SEANet encoder.

This class implements the SEANet encoder architecture for audio processing. It includes multiple convolutional layers, normalization, activation functions, and optional LSTM layers to produce an intermediate representation of audio signals.

channels

Number of audio channels (default: 1).

Type: int

dimension

Intermediate representation dimension (default: 128).

Type: int

n_filters

Base width for the model (default: 32).

Type: int

n_residual_layers

Number of residual layers (default: 1).

Type: int

ratios

Kernel size and stride ratios, must match the decoder order (default: [8, 5, 4, 2]).

Type: List[int]

activation

Activation function to use (default: “ELU”).

Type: str

activation

_params

Parameters for the activation function (default: {“alpha”: 1.0}).

Type: dict

norm

Normalization method to use (default: “weight_norm”).

Type: str

norm

_params

Parameters for the normalization method (default: {}).

Type: dict

kernel_size

Kernel size for the initial convolution (default: 7).

Type: int

last_kernel_size

Kernel size for the final convolution (default: 7).

Type: int

residual_kernel_size

Kernel size for the residual layers (default: 3).

Type: int

dilation_base

How much to increase the dilation with each layer (default: 2).

Type: int

causal

Whether to use fully causal convolution (default: False).

Type: bool

pad_mode

Padding mode for the convolutions (default: “reflect”).

Type: str

true_skip

Whether to use true skip connection or a simple convolution as the skip connection in residual blocks (default: False).

Type: bool

compress

Reduced dimensionality in residual branches (from Demucs v3, default: 2).

Type: int

lstm

Number of LSTM layers at the end of the encoder (default: 2).

Type: int
Parameters:
- channels (int) – Audio channels.
- dimension (int) – Intermediate representation dimension.
- n_filters (int) – Base width for the model.
- n_residual_layers (int) – Number of residual layers.
- ratios (Sequence *[*int ]) – Kernel size and stride ratios.
- activation (str) – Activation function.
- activation_params (dict) – Parameters for the activation function.
- norm (str) – Normalization method.
- norm_params (dict) – Parameters for the normalization method.
- kernel_size (int) – Kernel size for the initial convolution.
- last_kernel_size (int) – Kernel size for the final convolution.
- residual_kernel_size (int) – Kernel size for the residual layers.
- dilation_base (int) – Dilation increment for each layer.
- causal (bool) – Use fully causal convolution.
- pad_mode (str) – Padding mode for convolutions.
- true_skip (bool) – Use true skip connection.
- compress (int) – Reduced dimensionality in residual branches.
- lstm (int) – Number of LSTM layers at the end of the encoder.

####### Examples

>>> encoder = SEANetEncoder(channels=1, dimension=128)
>>> audio_input = torch.randn(1, 1, 16000)  # (batch_size, channels, length)
>>> output = encoder(audio_input)
>>> print(output.shape)  # Should be (1, 128, length after processing)

NOTE

The ratios attribute defines the downsampling factors used in the encoder, which should be specified in reverse order compared to the decoder.

Raises:AssertionError – If the number of kernel sizes does not match the number of dilations.