espnet2.asr.frontend.fused.FusedFrontends
espnet2.asr.frontend.fused.FusedFrontends
class espnet2.asr.frontend.fused.FusedFrontends(frontends=None, align_method='linear_projection', proj_dim=100, fs=16000)
Bases: AbsFrontend
A class to fuse multiple audio frontends for feature extraction.
This class combines multiple audio frontends, such as DefaultFrontend and S3prlFrontend, into a single module. It allows for the alignment and projection of features extracted from these frontends using a specified method. Currently, only linear projection is supported for fusing the frontends.
align_method
The method used for aligning features. Currently, only “linear_projection” is supported.
- Type: str
proj_dim
The dimension of the projection applied to each frontend’s output.
- Type: int
frontends
A list of frontends to combine.
- Type: ModuleList
gcd
The greatest common divisor of the hop lengths of the frontends.
- Type: int
factors
The factors for reshaping the output based on hop lengths.
- Type: list
projection_layers
A list of linear layers for projecting frontend outputs.
Type: ModuleList
Parameters:
- frontends (list) – A list of dictionaries specifying the frontends to combine. Each dictionary should include the type of frontend and its respective parameters.
- align_method (str , optional) – The alignment method for feature fusion. Defaults to “linear_projection”.
- proj_dim (int , optional) – The dimension for projection. Defaults to 100.
- fs (int , optional) – The sampling frequency. Defaults to 16000.
Returns: The fused feature tensor and the : lengths of the features.
Return type: Tuple[torch.Tensor, torch.Tensor]
Raises:NotImplementedError – If an unsupported frontend type is provided or if an unsupported alignment method is specified.
######### Examples
Example of initializing FusedFrontends with Default and S3PRL frontends
frontends_config = [
{“frontend_type”: “default”, “n_mels”: 80, “n_fft”: 512},
] fused_frontend = FusedFrontends(frontends=frontends_config)
Forward pass through the fused frontend
input_tensor = torch.randn(10, 16000) # Example input tensor input_lengths = torch.tensor([16000] * 10) # Example lengths output_feats, output_lengths = fused_frontend(input_tensor, input_lengths)
####### NOTE The class is currently limited to using the linear projection alignment method. Future implementations may include additional alignment methods.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(input: Tensor, input_lengths: Tensor) → Tuple[Tensor, Tensor]
Computes the forward pass for the FusedFrontends class, which processes input audio through multiple frontends and aligns the output features based on the specified alignment method.
- Parameters:
- input (torch.Tensor) – The input audio tensor of shape (batch_size, num_samples).
- input_lengths (torch.Tensor) – A tensor containing the lengths of the input sequences of shape (batch_size,).
- Returns: A tuple containing: : - A tensor of fused audio features of shape <br/> (batch_size, num_frames, output_size).
- A tensor of the lengths of the output features of shape (batch_size,).
- Return type: Tuple[torch.Tensor, torch.Tensor]
- Raises:NotImplementedError – If the alignment method is not supported.
######### Examples
>>> fused_frontend = FusedFrontends(frontends=[{"frontend_type": "default"}])
>>> input_tensor = torch.randn(10, 16000) # 10 samples of 1 second each
>>> input_lengths = torch.tensor([16000] * 10) # lengths for each sample
>>> output_feats, output_lengths = fused_frontend.forward(input_tensor, input_lengths)
>>> print(output_feats.shape) # Expected output shape: (10, num_frames, output_size)
####### NOTE The current implementation supports only the ‘linear_projection’ alignment method. Future updates may include additional methods.
output_size() → int
Calculates the output size of the fused frontends based on the number of frontends and the projection dimension.
The output size is determined by multiplying the number of frontends by the projection dimension specified during initialization. This value is useful for determining the shape of the output tensor after processing the input through the fused frontends.
- Returns: The total output size of the fused frontends.
- Return type: int
######### Examples
>>> fused_frontend = FusedFrontends(
... frontends=[
... {"frontend_type": "default", "n_mels": 80},
... {"frontend_type": "s3prl", "frontend_conf": {...}},
... ],
... proj_dim=100
... )
>>> output_size = fused_frontend.output_size()
>>> print(output_size)
200 # (2 frontends * 100 proj_dim)
####### NOTE The function assumes that the frontends attribute is properly initialized and contains valid frontend configurations.