espnet2.enh.encoder.conv_encoder.ConvEncoder

About 3 min

espnet2.enh.encoder.conv_encoder.ConvEncoder

class espnet2.enh.encoder.conv_encoder.ConvEncoder(channel: int, kernel_size: int, stride: int)

Bases: AbsEncoder

Convolutional encoder for speech enhancement and separation.

This class implements a convolutional encoder using 1D convolutional layers. It is designed to process mixed speech inputs for tasks such as speech enhancement and separation.

output_dim

The dimension of the output features after encoding.

Type: int
Parameters:
- channel (int) – The number of output channels in the convolutional layer.
- kernel_size (int) – The size of the convolutional kernel.
- stride (int) – The stride of the convolution.
Returns: Mixed feature after encoder : with shape [Batch, flens, channel].
flens (torch.Tensor): Output lengths after encoding.
Return type: feature (torch.Tensor)
Raises:AssertionError – If the input tensor does not have the correct dimensions.

############# Examples

>>> import torch
>>> input_audio = torch.randn((2, 100))
>>> ilens = torch.LongTensor([100, 98])
>>> encoder = ConvEncoder(kernel_size=32, stride=10, channel=16)
>>> frames, flens = encoder(input_audio, ilens)

>>> splited = encoder.streaming_frame(input_audio)
>>> sframes = [encoder.forward_streaming(s) for s in splited]
>>> sframes = torch.cat(sframes, dim=1)
>>> torch.testing.assert_allclose(sframes, frames)

######## NOTE The fs argument in the forward method is not used and can be set to None.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, ilens: Tensor, fs: int | None = None)

Forward pass of the convolutional encoder.

This method processes the input mixed speech signal through a 1D convolutional layer, applying ReLU activation and returning the encoded features along with the calculated lengths of the output features.

Parameters:
- input (torch.Tensor) – Mixed speech input of shape [Batch, sample].
- ilens (torch.Tensor) – Lengths of the input sequences of shape [Batch].
- fs (int , optional) – Sampling rate in Hz (not used in current implementation).
Returns: Encoded mixed feature of shape [Batch, flens, channel], : where flens is the calculated output length after processing.
flens (torch.Tensor): Lengths of the output features of shape [Batch].
Return type: feature (torch.Tensor)
Raises:AssertionError – If the input tensor does not have 2 dimensions.

############# Examples

>>> input_audio = torch.randn((2, 100))
>>> ilens = torch.LongTensor([100, 98])
>>> encoder = ConvEncoder(kernel_size=32, stride=10, channel=16)
>>> feature, flens = encoder(input_audio, ilens)
>>> print(feature.shape)  # Output shape: [Batch, flens, channel]
>>> print(flens)  # Output lengths for each batch

######## NOTE The input tensor is expected to be a single-channel tensor.

forward_streaming(input: Tensor)

Perform the forward pass for streaming input.

This method is designed to handle streaming audio inputs by utilizing the forward method. It takes a tensor representing audio data and processes it through the convolutional encoder to produce the output features.

Parameters:input (torch.Tensor) – Input tensor representing mixed speech with shape [Batch, sample].
Returns: Output tensor containing mixed features after : encoding with shape [Batch, flens, channel].
Return type: output (torch.Tensor)

############# Examples

>>> encoder = ConvEncoder(kernel_size=32, stride=10, channel=16)
>>> input_audio = torch.randn((2, 100))
>>> output = encoder.forward_streaming(input_audio)
>>> print(output.shape)
torch.Size([2, 7, 16])  # Example output shape based on kernel and stride

######## NOTE The ilens parameter is not utilized in this method, and it defaults to a fixed behavior from the forward method.

property output_dim : int

Get the output dimension of the ConvEncoder.

This property returns the number of output channels produced by the convolutional layer in the encoder.

Returns: The output dimension, which corresponds to the number of channels specified during initialization.
Return type: int

############# Examples

encoder = ConvEncoder(kernel_size=3, stride=1, channel=16) print(encoder.output_dim) # Output: 16

######## NOTE This property is primarily used to obtain the output dimension after the convolutional processing of the input.

streaming_frame(audio: Tensor)

Stream frame.

It splits the continuous audio into frame-level audio chunks in the streaming simulation. This function takes the entire long audio as input for a streaming simulation. You may refer to this function to manage your streaming input buffer in a real streaming application.

Parameters:
- audio (torch.Tensor) – Input audio tensor of shape (B, T), where B is
- audio. (the batch size and T is the total length of the)
Returns: A list of chunked audio tensors, each of shape (B, frame_size), where frame_size is determined by the kernel size.
Return type: List[torch.Tensor]

############# Examples

>>> encoder = ConvEncoder(kernel_size=32, stride=10, channel=16)
>>> audio_input = torch.randn((2, 100))
>>> frames = encoder.streaming_frame(audio_input)
>>> for frame in frames:
...     print(frame.shape)
torch.Size([2, 32])
torch.Size([2, 32])
...