espnet2.enh.encoder.conv_encoder.ConvEncoder
espnet2.enh.encoder.conv_encoder.ConvEncoder
class espnet2.enh.encoder.conv_encoder.ConvEncoder(channel: int, kernel_size: int, stride: int)
Bases: AbsEncoder
Convolutional encoder for speech enhancement and separation.
This class implements a convolutional encoder using 1D convolutional layers. It is designed to process mixed speech inputs for tasks such as speech enhancement and separation.
output_dim
The dimension of the output features after encoding.
Type: int
Parameters:
- channel (int) – The number of output channels in the convolutional layer.
- kernel_size (int) – The size of the convolutional kernel.
- stride (int) – The stride of the convolution.
Returns: Mixed feature after encoder : with shape [Batch, flens, channel].
flens (torch.Tensor): Output lengths after encoding.
Return type: feature (torch.Tensor)
Raises:AssertionError – If the input tensor does not have the correct dimensions.
############# Examples
>>> import torch
>>> input_audio = torch.randn((2, 100))
>>> ilens = torch.LongTensor([100, 98])
>>> encoder = ConvEncoder(kernel_size=32, stride=10, channel=16)
>>> frames, flens = encoder(input_audio, ilens)
>>> splited = encoder.streaming_frame(input_audio)
>>> sframes = [encoder.forward_streaming(s) for s in splited]
>>> sframes = torch.cat(sframes, dim=1)
>>> torch.testing.assert_allclose(sframes, frames)
######## NOTE The fs argument in the forward method is not used and can be set to None.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(input: Tensor, ilens: Tensor, fs: int | None = None)
Forward pass of the convolutional encoder.
This method processes the input mixed speech signal through a 1D convolutional layer, applying ReLU activation and returning the encoded features along with the calculated lengths of the output features.
Parameters:
- input (torch.Tensor) – Mixed speech input of shape [Batch, sample].
- ilens (torch.Tensor) – Lengths of the input sequences of shape [Batch].
- fs (int , optional) – Sampling rate in Hz (not used in current implementation).
Returns: Encoded mixed feature of shape [Batch, flens, channel], : where flens is the calculated output length after processing.
flens (torch.Tensor): Lengths of the output features of shape [Batch].
Return type: feature (torch.Tensor)
Raises:AssertionError – If the input tensor does not have 2 dimensions.
############# Examples
>>> input_audio = torch.randn((2, 100))
>>> ilens = torch.LongTensor([100, 98])
>>> encoder = ConvEncoder(kernel_size=32, stride=10, channel=16)
>>> feature, flens = encoder(input_audio, ilens)
>>> print(feature.shape) # Output shape: [Batch, flens, channel]
>>> print(flens) # Output lengths for each batch
######## NOTE The input tensor is expected to be a single-channel tensor.
forward_streaming(input: Tensor)
Perform the forward pass for streaming input.
This method is designed to handle streaming audio inputs by utilizing the forward method. It takes a tensor representing audio data and processes it through the convolutional encoder to produce the output features.
- Parameters:input (torch.Tensor) – Input tensor representing mixed speech with shape [Batch, sample].
- Returns: Output tensor containing mixed features after : encoding with shape [Batch, flens, channel].
- Return type: output (torch.Tensor)
############# Examples
>>> encoder = ConvEncoder(kernel_size=32, stride=10, channel=16)
>>> input_audio = torch.randn((2, 100))
>>> output = encoder.forward_streaming(input_audio)
>>> print(output.shape)
torch.Size([2, 7, 16]) # Example output shape based on kernel and stride
######## NOTE The ilens parameter is not utilized in this method, and it defaults to a fixed behavior from the forward method.
property output_dim : int
Get the output dimension of the ConvEncoder.
This property returns the number of output channels produced by the convolutional layer in the encoder.
- Returns: The output dimension, which corresponds to the number of channels specified during initialization.
- Return type: int
############# Examples
encoder = ConvEncoder(kernel_size=3, stride=1, channel=16) print(encoder.output_dim) # Output: 16
######## NOTE This property is primarily used to obtain the output dimension after the convolutional processing of the input.
streaming_frame(audio: Tensor)
Stream frame.
It splits the continuous audio into frame-level audio chunks in the streaming simulation. This function takes the entire long audio as input for a streaming simulation. You may refer to this function to manage your streaming input buffer in a real streaming application.
- Parameters:
- audio (torch.Tensor) – Input audio tensor of shape (B, T), where B is
- audio. (the batch size and T is the total length of the)
- Returns: A list of chunked audio tensors, each of shape (B, frame_size), where frame_size is determined by the kernel size.
- Return type: List[torch.Tensor]
############# Examples
>>> encoder = ConvEncoder(kernel_size=32, stride=10, channel=16)
>>> audio_input = torch.randn((2, 100))
>>> frames = encoder.streaming_frame(audio_input)
>>> for frame in frames:
... print(frame.shape)
torch.Size([2, 32])
torch.Size([2, 32])
...