espnet2.enh.decoder.conv_decoder.ConvDecoder
espnet2.enh.decoder.conv_decoder.ConvDecoder
class espnet2.enh.decoder.conv_decoder.ConvDecoder(channel: int, kernel_size: int, stride: int)
Bases: AbsDecoder
ConvDecoder is a transposed convolutional decoder for speech enhancement and
separation.
This class extends the AbsDecoder and provides functionality to decode the output of a convolutional encoder into a time-domain waveform. The decoder utilizes a transposed convolutional layer to perform the decoding operation, which is crucial in tasks such as speech enhancement and separation.
convtrans1d
The transposed convolutional layer used for decoding.
- Type: torch.nn.ConvTranspose1d
kernel_size
The size of the kernel used in the transposed convolution.
- Type: int
stride
The stride of the transposed convolution.
Type: int
Parameters:
- channel (int) – The number of input channels for the transposed convolution.
- kernel_size (int) – The size of the convolutional kernel.
- stride (int) – The stride for the transposed convolution.
forward(input
torch.Tensor, ilens: torch.Tensor, fs: int = None) -> Tuple[torch.Tensor, torch.Tensor]:
Performs the forward pass, decoding the input tensor into a waveform.
forward_streaming(input_frame
torch.Tensor) -> torch.Tensor: Performs streaming forward pass for the input frame.
streaming_merge(chunks
torch.Tensor, ilens: torch.Tensor = None) -> torch.Tensor:
Merges frame-level processed audio chunks in a streaming simulation.
- Raises:ValueError – If input tensor dimensions do not match expected shapes.
######### Examples
>>> import torch
>>> input_audio = torch.randn((1, 100))
>>> ilens = torch.LongTensor([100])
>>> kernel_size = 32
>>> stride = 16
>>> decoder = ConvDecoder(kernel_size=kernel_size, stride=stride, channel=16)
>>> wav, ilens = decoder(input_audio, ilens)
NOTE
The fs parameter in the forward method is currently not utilized.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
#
forward(input
Forward.
- Parameters:
- input (torch.Tensor) – spectrum [Batch, T, F]
- ilens (torch.Tensor) – input lengths [Batch]
- fs (int) – sampling rate in Hz (Not used)
#
forward_streaming(input_frame
Forward streaming of audio frames through the ConvDecoder.
This method processes a single input frame and returns the output waveform corresponding to that frame. It is primarily used for streaming applications where audio is processed in small chunks.
- Parameters:input_frame (torch.Tensor) – A tensor representing the input frame of audio to be processed. The shape should be [B, F], where B is the batch size and F is the frame size.
- Returns: The output waveform after processing the input frame, with the shape [B, T], where T is the length of the output waveform.
- Return type: torch.Tensor
######### Examples
>>> decoder = ConvDecoder(channel=16, kernel_size=32, stride=16)
>>> input_frame = torch.randn(1, 32) # Example input frame
>>> output_waveform = decoder.forward_streaming(input_frame)
>>> print(output_waveform.shape) # Output shape will be [1, T]
#
streaming_merge(chunks
Stream Merge.
It merges the frame-level processed audio chunks in the streaming simulation. It is noted that, in real applications, the processed audio should be sent to the output channel frame by frame. You may refer to this function to manage your streaming output buffer.
- Parameters:
- chunks (torch.Tensor) – A list of tensors where each tensor has the shape (B, frame_size), representing processed audio chunks.
- ilens (torch.Tensor , optional) – A tensor of shape [B] containing the lengths of each batch. If not provided, the maximum length will be calculated based on the number of chunks.
- Returns: A tensor of shape [B, T] representing the merged audio : output, where T is the total length of the merged audio.
- Return type: torch.Tensor
######### Examples
>>> decoder = ConvDecoder(channel=16, kernel_size=32, stride=16)
>>> chunks = [torch.randn(1, 32) for _ in range(5)]
>>> merged_audio = decoder.streaming_merge(chunks)
>>> print(merged_audio.shape)
torch.Size([1, 128]) # Example output shape based on the chunks
NOTE
The chunks should be provided in the order they were processed, and the merging assumes that the frames overlap according to the defined stride.