espnet2.enh.decoder.conv_decoder.ConvDecoder

About 2 min

espnet2.enh.decoder.conv_decoder.ConvDecoder

class espnet2.enh.decoder.conv_decoder.ConvDecoder(channel: int, kernel_size: int, stride: int)

Bases: AbsDecoder

ConvDecoder is a transposed convolutional decoder for speech enhancement and

separation.

This class extends the AbsDecoder and provides functionality to decode the output of a convolutional encoder into a time-domain waveform. The decoder utilizes a transposed convolutional layer to perform the decoding operation, which is crucial in tasks such as speech enhancement and separation.

convtrans1d

The transposed convolutional layer used for decoding.

Type: torch.nn.ConvTranspose1d

kernel_size

The size of the kernel used in the transposed convolution.

Type: int

stride

The stride of the transposed convolution.

Type: int
Parameters:
- channel (int) – The number of input channels for the transposed convolution.
- kernel_size (int) – The size of the convolutional kernel.
- stride (int) – The stride for the transposed convolution.

forward(input

torch.Tensor, ilens: torch.Tensor, fs: int = None) -> Tuple[torch.Tensor, torch.Tensor]:

Performs the forward pass, decoding the input tensor into a waveform.

forward_streaming(input_frame

torch.Tensor) -> torch.Tensor: Performs streaming forward pass for the input frame.

streaming_merge(chunks

torch.Tensor, ilens: torch.Tensor = None) -> torch.Tensor:

Merges frame-level processed audio chunks in a streaming simulation.

Raises:ValueError – If input tensor dimensions do not match expected shapes.

######### Examples

>>> import torch
>>> input_audio = torch.randn((1, 100))
>>> ilens = torch.LongTensor([100])
>>> kernel_size = 32
>>> stride = 16
>>> decoder = ConvDecoder(kernel_size=kernel_size, stride=stride, channel=16)
>>> wav, ilens = decoder(input_audio, ilens)

NOTE

The fs parameter in the forward method is currently not utilized.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input

: Tensor, ilens: Tensor, fs: int | None = None)

Forward.

Parameters:
- input (torch.Tensor) – spectrum [Batch, T, F]
- ilens (torch.Tensor) – input lengths [Batch]
- fs (int) – sampling rate in Hz (Not used)

forward_streaming(input_frame

: Tensor)

Forward streaming of audio frames through the ConvDecoder.

This method processes a single input frame and returns the output waveform corresponding to that frame. It is primarily used for streaming applications where audio is processed in small chunks.

Parameters:input_frame (torch.Tensor) – A tensor representing the input frame of audio to be processed. The shape should be [B, F], where B is the batch size and F is the frame size.
Returns: The output waveform after processing the input frame, with the shape [B, T], where T is the length of the output waveform.
Return type: torch.Tensor

######### Examples

>>> decoder = ConvDecoder(channel=16, kernel_size=32, stride=16)
>>> input_frame = torch.randn(1, 32)  # Example input frame
>>> output_waveform = decoder.forward_streaming(input_frame)
>>> print(output_waveform.shape)  # Output shape will be [1, T]

streaming_merge(chunks

: Tensor, ilens: tensor | None = None)

Stream Merge.

It merges the frame-level processed audio chunks in the streaming simulation. It is noted that, in real applications, the processed audio should be sent to the output channel frame by frame. You may refer to this function to manage your streaming output buffer.

Parameters:
- chunks (torch.Tensor) – A list of tensors where each tensor has the shape (B, frame_size), representing processed audio chunks.
- ilens (torch.Tensor , optional) – A tensor of shape [B] containing the lengths of each batch. If not provided, the maximum length will be calculated based on the number of chunks.
Returns: A tensor of shape [B, T] representing the merged audio : output, where T is the total length of the merged audio.
Return type: torch.Tensor

######### Examples

>>> decoder = ConvDecoder(channel=16, kernel_size=32, stride=16)
>>> chunks = [torch.randn(1, 32) for _ in range(5)]
>>> merged_audio = decoder.streaming_merge(chunks)
>>> print(merged_audio.shape)
torch.Size([1, 128])  # Example output shape based on the chunks

NOTE

The chunks should be provided in the order they were processed, and the merging assumes that the frames overlap according to the defined stride.