espnet2.gan_codec.shared.encoder.seanet_2d.SEANetEncoder2d
espnet2.gan_codec.shared.encoder.seanet_2d.SEANetEncoder2d
class espnet2.gan_codec.shared.encoder.seanet_2d.SEANetEncoder2d(channels: int = 1, dimension: int = 128, n_filters: int = 32, n_residual_layers: int = 1, ratios: List[Tuple[int, int]] = [(4, 1), (4, 1), (4, 2), (4, 1)], activation: str = 'ELU', activation_params: dict = {'alpha': 1.0}, norm: str = 'weight_norm', norm_params: Dict[str, Any] = {}, kernel_size: int = 7, last_kernel_size: int = 7, residual_kernel_size: int = 3, dilation_base: int = 2, causal: bool = False, pad_mode: str = 'reflect', true_skip: bool = False, compress: int = 2, lstm: int = 2, res_seq=True, conv_group_ratio: int = -1)
Bases: Module
SEANet encoder for audio signal processing.
This class implements the SEANet encoder architecture, which is designed to process audio signals through a series of convolutional layers, residual blocks, and optional LSTM layers. The encoder reduces the dimensionality of the input while preserving important features for subsequent decoding.
- Parameters:
- channels (int) – Audio channels (default: 1).
- dimension (int) – Intermediate representation dimension (default: 128).
- n_filters (int) – Base width for the model (default: 32).
- n_residual_layers (int) – Number of residual layers (default: 1).
- ratios (List *[*Tuple *[*int , int ] ]) – Kernel size and stride ratios for downsampling (default: [(4, 1), (4, 1), (4, 2), (4, 1)]).
- activation (str) – Activation function (default: “ELU”).
- activation_params (dict) – Parameters for the activation function (default: {“alpha”: 1.0}).
- norm (str) – Normalization method (default: “weight_norm”).
- norm_params (Dict *[*str , Any ]) – Parameters for the normalization method (default: {}).
- kernel_size (int) – Kernel size for the initial convolution (default: 7).
- last_kernel_size (int) – Kernel size for the last convolution (default: 7).
- residual_kernel_size (int) – Kernel size for the residual layers (default: 3).
- dilation_base (int) – Increase factor for dilation with each layer (default: 2).
- causal (bool) – Whether to use fully causal convolution (default: False).
- pad_mode (str) – Padding mode for convolutions (default: “reflect”).
- true_skip (bool) – Use true skip connection or simple convolution for skip connection (default: False).
- compress (int) – Reduced dimensionality in residual branches (default: 2).
- lstm (int) – Number of LSTM layers at the end of the encoder (default: 2).
- res_seq (bool) – Whether to use a residual sequence (default: True).
- conv_group_ratio (int) – Ratio for grouping convolutions (default: -1).
####### Examples
>>> encoder = SEANetEncoder2d(channels=2, dimension=256)
>>> input_tensor = torch.randn(1, 2, 16000) # (batch_size, channels, time)
>>> output = encoder(input_tensor)
>>> output.shape
torch.Size([1, 256, <time_dim>]) # Time dimension varies based on input
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x)
SEANet encoder.
This class implements the SEANet encoder architecture for audio processing, utilizing a series of convolutional layers, residual blocks, and optional LSTM layers to produce an intermediate representation of audio inputs.
channels
Number of audio channels.
- Type: int
dimension
Intermediate representation dimension.
- Type: int
n_filters
Base width for the model.
- Type: int
n_residual_layers
Number of residual layers.
- Type: int
ratios
Kernel size and stride ratios for downsampling.
- Type: List[Tuple[int, int]]
hop_length
The total hop length calculated from the ratios.
Type: int
Parameters:
- channels (int) – Audio channels. Defaults to 1.
- dimension (int) – Intermediate representation dimension. Defaults to 128.
- n_filters (int) – Base width for the model. Defaults to 32.
- n_residual_layers (int) – Number of residual layers. Defaults to 1.
- ratios (List *[*Tuple *[*int , int ] ]) – Kernel size and stride ratios. Defaults to [(4, 1), (4, 1), (4, 2), (4, 1)].
- activation (str) – Activation function. Defaults to “ELU”.
- activation_params (dict) – Parameters for the activation function. Defaults to {“alpha”: 1.0}.
- norm (str) – Normalization method. Defaults to “weight_norm”.
- norm_params (Dict *[*str , Any ]) – Parameters for the underlying normalization.
- kernel_size (int) – Kernel size for the initial convolution. Defaults to 7.
- last_kernel_size (int) – Kernel size for the last convolution. Defaults to 7.
- residual_kernel_size (int) – Kernel size for the residual layers. Defaults to 3.
- dilation_base (int) – How much to increase the dilation with each layer. Defaults to 2.
- causal (bool) – Whether to use fully causal convolution. Defaults to False.
- pad_mode (str) – Padding mode for convolutions. Defaults to “reflect”.
- true_skip (bool) – Whether to use true skip connection or a simple convolution. Defaults to False.
- compress (int) – Reduced dimensionality in residual branches. Defaults to 2.
- lstm (int) – Number of LSTM layers at the end of the encoder. Defaults to 2.
- res_seq (bool) – Flag to indicate whether to apply sequential processing. Defaults to True.
- conv_group_ratio (int) – Group ratio for convolutions. Defaults to -1.
####### Examples
>>> encoder = SEANetEncoder2d(channels=2, dimension=256)
>>> audio_input = torch.randn(1, 2, 16000) # Batch size of 1, 2 channels, 16000 samples
>>> output = encoder(audio_input)
>>> output.shape
torch.Size([1, 256, T]) # Output shape depends on the internal configuration
- Raises:AssertionError – If the number of kernel sizes does not match the number of dilations.