espnet2.gan_codec.shared.encoder.seanet.SEANetEncoder
espnet2.gan_codec.shared.encoder.seanet.SEANetEncoder
class espnet2.gan_codec.shared.encoder.seanet.SEANetEncoder(channels: int = 1, dimension: int = 128, n_filters: int = 32, n_residual_layers: int = 1, ratios: List[int] = [8, 5, 4, 2], activation: str = 'ELU', activation_params: dict = {'alpha': 1.0}, norm: str = 'weight_norm', norm_params: Dict[str, Any] = {}, kernel_size: int = 7, last_kernel_size: int = 7, residual_kernel_size: int = 3, dilation_base: int = 2, causal: bool = False, pad_mode: str = 'reflect', true_skip: bool = False, compress: int = 2, lstm: int = 2)
Bases: Module
SEANet encoder.
This class implements the SEANet encoder, which is a neural network architecture designed for audio processing tasks. The encoder utilizes convolutional layers, residual blocks, and optional LSTM layers to extract features from audio input.
channels
Number of audio channels (default is 1).
- Type: int
dimension
Dimension of the intermediate representation (default is 128).
- Type: int
n_filters
Base width for the model (default is 32).
- Type: int
n_residual_layers
Number of residual layers (default is 1).
- Type: int
ratios
Downsampling ratios (default is [8, 5, 4, 2]).
- Type: List[int]
activation
Activation function (default is “ELU”).
- Type: str
activation
Parameters for the activation function (default is {“alpha”: 1.0}).
- Type: dict
norm
Normalization method (default is “weight_norm”).
- Type: str
norm
Parameters for the underlying normalization used with the convolution (default is an empty dictionary).
- Type: dict
kernel_size
Kernel size for the initial convolution (default is 7).
- Type: int
last_kernel_size
Kernel size for the last convolution (default is 7).
- Type: int
residual_kernel_size
Kernel size for the residual layers (default is 3).
- Type: int
dilation_base
Base value for increasing dilation with each layer (default is 2).
- Type: int
causal
Whether to use fully causal convolution (default is False).
- Type: bool
pad_mode
Padding mode for convolutions (default is “reflect”).
- Type: str
true_skip
Whether to use true skip connections or a simple convolution as the skip connection in the residual blocks (default is False).
- Type: bool
compress
Reduced dimensionality in residual branches (default is 2).
- Type: int
lstm
Number of LSTM layers at the end of the encoder (default is 2).
Type: int
Parameters:
- channels (int) – Audio channels.
- dimension (int) – Intermediate representation dimension.
- n_filters (int) – Base width for the model.
- n_residual_layers (int) – Number of residual layers.
- ratios (Sequence *[*int ]) – Kernel size and stride ratios. The encoder uses downsampling ratios instead of upsampling ratios, hence it will use the ratios in the reverse order to the ones specified here that must match the decoder order.
- activation (str) – Activation function.
- activation_params (dict) – Parameters to provide to the activation function.
- norm (str) – Normalization method.
- norm_params (dict) – Parameters to provide to the underlying normalization used along with the convolution.
- kernel_size (int) – Kernel size for the initial convolution.
- last_kernel_size (int) – Kernel size for the last convolution.
- residual_kernel_size (int) – Kernel size for the residual layers.
- dilation_base (int) – How much to increase the dilation with each layer.
- causal (bool) – Whether to use fully causal convolution.
- pad_mode (str) – Padding mode for the convolutions.
- true_skip (bool) – Whether to use true skip connection or a simple (streamable) convolution as the skip connection in the residual network blocks.
- compress (int) – Reduced dimensionality in residual branches (from Demucs v3).
- lstm (int) – Number of LSTM layers at the end of the encoder.
####### Examples
>>> encoder = SEANetEncoder(channels=1, dimension=128)
>>> input_tensor = torch.randn(1, 1, 16000) # Batch size of 1, 1 channel, 16000 samples
>>> output = encoder(input_tensor)
>>> print(output.shape)
torch.Size([1, 128, 2000]) # Example output shape
NOTE
This encoder is part of a larger audio processing framework and is intended for use in GAN-based audio synthesis tasks.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x)
SEANet encoder.
This class implements the SEANet encoder architecture for audio processing. It includes multiple convolutional layers, normalization, activation functions, and optional LSTM layers to produce an intermediate representation of audio signals.
channels
Number of audio channels (default: 1).
- Type: int
dimension
Intermediate representation dimension (default: 128).
- Type: int
n_filters
Base width for the model (default: 32).
- Type: int
n_residual_layers
Number of residual layers (default: 1).
- Type: int
ratios
Kernel size and stride ratios, must match the decoder order (default: [8, 5, 4, 2]).
- Type: List[int]
activation
Activation function to use (default: “ELU”).
- Type: str
activation
Parameters for the activation function (default: {“alpha”: 1.0}).
- Type: dict
norm
Normalization method to use (default: “weight_norm”).
- Type: str
norm
Parameters for the normalization method (default: {}).
- Type: dict
kernel_size
Kernel size for the initial convolution (default: 7).
- Type: int
last_kernel_size
Kernel size for the final convolution (default: 7).
- Type: int
residual_kernel_size
Kernel size for the residual layers (default: 3).
- Type: int
dilation_base
How much to increase the dilation with each layer (default: 2).
- Type: int
causal
Whether to use fully causal convolution (default: False).
- Type: bool
pad_mode
Padding mode for the convolutions (default: “reflect”).
- Type: str
true_skip
Whether to use true skip connection or a simple convolution as the skip connection in residual blocks (default: False).
- Type: bool
compress
Reduced dimensionality in residual branches (from Demucs v3, default: 2).
- Type: int
lstm
Number of LSTM layers at the end of the encoder (default: 2).
Type: int
Parameters:
- channels (int) – Audio channels.
- dimension (int) – Intermediate representation dimension.
- n_filters (int) – Base width for the model.
- n_residual_layers (int) – Number of residual layers.
- ratios (Sequence *[*int ]) – Kernel size and stride ratios.
- activation (str) – Activation function.
- activation_params (dict) – Parameters for the activation function.
- norm (str) – Normalization method.
- norm_params (dict) – Parameters for the normalization method.
- kernel_size (int) – Kernel size for the initial convolution.
- last_kernel_size (int) – Kernel size for the final convolution.
- residual_kernel_size (int) – Kernel size for the residual layers.
- dilation_base (int) – Dilation increment for each layer.
- causal (bool) – Use fully causal convolution.
- pad_mode (str) – Padding mode for convolutions.
- true_skip (bool) – Use true skip connection.
- compress (int) – Reduced dimensionality in residual branches.
- lstm (int) – Number of LSTM layers at the end of the encoder.
####### Examples
>>> encoder = SEANetEncoder(channels=1, dimension=128)
>>> audio_input = torch.randn(1, 1, 16000) # (batch_size, channels, length)
>>> output = encoder(audio_input)
>>> print(output.shape) # Should be (1, 128, length after processing)
NOTE
The ratios attribute defines the downsampling factors used in the encoder, which should be specified in reverse order compared to the decoder.
- Raises:AssertionError – If the number of kernel sizes does not match the number of dilations.