espnet2.asr.frontend.windowing.SlidingWindow

About 2 min

espnet2.asr.frontend.windowing.SlidingWindow

class espnet2.asr.frontend.windowing.SlidingWindow(win_length: int = 400, hop_length: int = 160, channels: int = 1, padding: int | None = None, fs=None)

Bases: AbsFrontend

Sliding Window.

This class provides a sliding window mechanism over a batched continuous raw audio tensor. It is designed to be used in conjunction with a pre-encoder compatible with raw audio data, such as Sinc convolutions. The class currently does not implement padding, and there are known issues regarding output length calculation when the audio input is shorter than the specified window length. Please note that trailing values are discarded due to the lack of padding.

Sampling rate (not used currently).

Type: Optional

win_length

Length of the frame for the sliding window.

Type: int

hop_length

Relative starting point of the next frame.

Type: int

channels

Number of input channels.

Type: int

padding

Placeholder for padding (not implemented).

Type: Optional[int]
Parameters:
- win_length (int) – Length of frame (default: 400).
- hop_length (int) – Relative starting point of next frame (default: 160).
- channels (int) – Number of input channels (default: 1).
- padding (Optional *[*int ]) – Padding (currently not implemented).
- fs – Sampling rate (placeholder for compatibility, not used).

Known Issues: : - Output length is calculated incorrectly if audio is shorter than <br/> win_length.

WARNING: trailing values are discarded - padding not implemented yet.
No additional window function is applied to input values.

######### Examples

>>> sliding_window = SlidingWindow(win_length=400, hop_length=160)
>>> input_tensor = torch.randn(2, 800, 1)  # Example input
>>> input_lengths = torch.tensor([800, 800])  # Example lengths
>>> output, output_lengths = sliding_window.forward(input_tensor, input_lengths)
>>> print(output.shape)  # Should show the output shape
>>> print(output_lengths)  # Should show the output lengths

Initialize.

Parameters:
- win_length – Length of frame.
- hop_length – Relative starting point of next frame.
- channels – Number of input channels.
- padding – Padding (placeholder, currently not implemented).
- fs – Sampling rate (placeholder for compatibility, not used).

forward(input: Tensor, input_lengths: Tensor) → Tuple[Tensor, Tensor]

Apply a sliding window on the input tensor.

This method processes a batch of audio input data using a sliding window approach, which allows the model to handle continuous audio signals in manageable frames. The method outputs the windowed audio along with the corresponding lengths of the output sequences.

Parameters:
- input –
  A tensor of shape (B, T, C*D) or (B, T*C*D), where:
  - B is the batch size,
  - T is the length of the input sequence,
  - C is the number of input channels,
  - D is the window length (for the case of (B, T*C*D),
  it is assumed that D=1).
- input_lengths – A tensor of shape (B,) representing the lengths of each input sequence in the batch.
Returns: A tuple containing: : - A tensor of shape (B, T, C, D) representing the windowed output, where D is the window length.
- A tensor of shape (B,) representing the output lengths for each sequence in the batch.
Return type: Tuple[torch.Tensor, torch.Tensor]

######### Examples

>>> import torch
>>> sliding_window = SlidingWindow(win_length=400, hop_length=160)
>>> input_tensor = torch.randn(2, 800, 1)  # (B=2, T=800, C=1)
>>> input_lengths = torch.tensor([800, 800])  # Lengths for each batch
>>> output, output_lengths = sliding_window.forward(input_tensor, input_lengths)
>>> print(output.shape)  # Should output (2, num_windows, 1, 400)
>>> print(output_lengths)  # Output lengths based on input lengths

NOTE

The method currently does not apply any window function to the input values.
Trailing values may be discarded due to the absence of padding implementation.

Raises:ValueError – If the input tensor dimensions do not match the expected shape.

output_size() → int

Return the output length of the feature dimension D.

This method provides the length of the output feature dimension D, which corresponds to the defined window length used in the sliding window operation.

Returns: The length of the output feature dimension D, which is equal to the window length (win_length).
Return type: int

######### Examples

>>> sliding_window = SlidingWindow(win_length=400)
>>> sliding_window.output_size()
400