espnet2.asr.frontend.fused.FusedFrontends

About 2 min

espnet2.asr.frontend.fused.FusedFrontends

class espnet2.asr.frontend.fused.FusedFrontends(frontends=None, align_method='linear_projection', proj_dim=100, fs=16000)

Bases: AbsFrontend

A class to fuse multiple audio frontends for feature extraction.

This class combines multiple audio frontends, such as DefaultFrontend and S3prlFrontend, into a single module. It allows for the alignment and projection of features extracted from these frontends using a specified method. Currently, only linear projection is supported for fusing the frontends.

align_method

The method used for aligning features. Currently, only “linear_projection” is supported.

Type: str

proj_dim

The dimension of the projection applied to each frontend’s output.

Type: int

frontends

A list of frontends to combine.

Type: ModuleList

gcd

The greatest common divisor of the hop lengths of the frontends.

Type: int

factors

The factors for reshaping the output based on hop lengths.

Type: list

projection_layers

A list of linear layers for projecting frontend outputs.

Type: ModuleList
Parameters:
- frontends (list) – A list of dictionaries specifying the frontends to combine. Each dictionary should include the type of frontend and its respective parameters.
- align_method (str , optional) – The alignment method for feature fusion. Defaults to “linear_projection”.
- proj_dim (int , optional) – The dimension for projection. Defaults to 100.
- fs (int , optional) – The sampling frequency. Defaults to 16000.
Returns: The fused feature tensor and the : lengths of the features.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:NotImplementedError – If an unsupported frontend type is provided or if an unsupported alignment method is specified.

######### Examples

Example of initializing FusedFrontends with Default and S3PRL frontends

frontends_config = [

{“frontend_type”: “default”, “n_mels”: 80, “n_fft”: 512},

] fused_frontend = FusedFrontends(frontends=frontends_config)

Forward pass through the fused frontend

input_tensor = torch.randn(10, 16000) # Example input tensor input_lengths = torch.tensor([16000] * 10) # Example lengths output_feats, output_lengths = fused_frontend(input_tensor, input_lengths)

####### NOTE The class is currently limited to using the linear projection alignment method. Future implementations may include additional alignment methods.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input: Tensor, input_lengths: Tensor) → Tuple[Tensor, Tensor]

Computes the forward pass for the FusedFrontends class, which processes input audio through multiple frontends and aligns the output features based on the specified alignment method.

Parameters:
- input (torch.Tensor) – The input audio tensor of shape (batch_size, num_samples).
- input_lengths (torch.Tensor) – A tensor containing the lengths of the input sequences of shape (batch_size,).
Returns: A tuple containing: : - A tensor of fused audio features of shape <br/> (batch_size, num_frames, output_size).
- A tensor of the lengths of the output features of shape (batch_size,).
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:NotImplementedError – If the alignment method is not supported.

######### Examples

>>> fused_frontend = FusedFrontends(frontends=[{"frontend_type": "default"}])
>>> input_tensor = torch.randn(10, 16000)  # 10 samples of 1 second each
>>> input_lengths = torch.tensor([16000] * 10)  # lengths for each sample
>>> output_feats, output_lengths = fused_frontend.forward(input_tensor, input_lengths)
>>> print(output_feats.shape)  # Expected output shape: (10, num_frames, output_size)

####### NOTE The current implementation supports only the ‘linear_projection’ alignment method. Future updates may include additional methods.

output_size() → int

Calculates the output size of the fused frontends based on the number of frontends and the projection dimension.

The output size is determined by multiplying the number of frontends by the projection dimension specified during initialization. This value is useful for determining the shape of the output tensor after processing the input through the fused frontends.

Returns: The total output size of the fused frontends.
Return type: int

######### Examples

>>> fused_frontend = FusedFrontends(
...     frontends=[
...         {"frontend_type": "default", "n_mels": 80},
...         {"frontend_type": "s3prl", "frontend_conf": {...}},
...     ],
...     proj_dim=100
... )
>>> output_size = fused_frontend.output_size()
>>> print(output_size)
200  # (2 frontends * 100 proj_dim)

####### NOTE The function assumes that the frontends attribute is properly initialized and contains valid frontend configurations.