espnet2.asr.encoder.wav2vec2_encoder.FairSeqWav2Vec2Encoder

About 3 min

espnet2.asr.encoder.wav2vec2_encoder.FairSeqWav2Vec2Encoder

class espnet2.asr.encoder.wav2vec2_encoder.FairSeqWav2Vec2Encoder(input_size: int, w2v_url: str, w2v_dir_path: str = './', output_size: int = 256, normalize_before: bool = False, freeze_finetune_updates: int = 0)

Bases: AbsEncoder

FairSeq Wav2Vec2 encoder module for automatic speech recognition.

This encoder utilizes a pre-trained Wav2Vec2.0 model from FairSeq to extract features from audio input. It can be fine-tuned on specific tasks, allowing for flexible and powerful speech recognition capabilities.

Parameters:
- input_size (int) – Input dimension for the encoder.
- w2v_url (str) – URL to the Wav2Vec2.0 pretrained model.
- w2v_dir_path (str , optional) – Directory to download the Wav2Vec2.0 pretrained model. Defaults to “./”.
- output_size (int , optional) – Dimension of the output features after encoding. Defaults to 256.
- normalize_before (bool , optional) – Whether to apply layer normalization before the first block. Defaults to False.
- freeze_finetune_updates (int , optional) – Number of updates after which the encoder parameters can be fine-tuned. Defaults to 0.

encoders

The loaded Wav2Vec2 model used for encoding.

pretrained_params

A copy of the pretrained model’s parameters for reloading.

output_layer

An optional linear layer to adjust output dimensions.

normalize_before

A flag indicating whether normalization is applied before encoding.

freeze_finetune_updates

The threshold for starting fine-tuning.

Returns: The size of the output features.
Return type: output_size (int)

########### Examples

>>> encoder = FairSeqWav2Vec2Encoder(
...     input_size=161,
...     w2v_url="https://path/to/wav2vec2/model",
...     output_size=256,
... )
>>> xs_pad = torch.randn(10, 100, 161)  # (B, L, D)
>>> ilens = torch.tensor([100] * 10)  # Input lengths
>>> output, olens, _ = encoder(xs_pad, ilens)

####### NOTE Ensure that the FairSeq library is installed properly. You can install it using the command: cd ${MAIN_ROOT}/tools && make fairseq.done.

Raises:
- Exception – If the FairSeq library is not installed or if the model
- class is not compatible. –

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(xs_pad: Tensor, ilens: Tensor, prev_states: Tensor | None = None) → Tuple[Tensor, Tensor, Tensor | None]

Forward pass through the FairSeq Wav2Vec2 Encoder.

This method takes padded input tensors and their lengths, processes them through the Wav2Vec2 encoder, and returns the encoded output along with the corresponding output lengths and an optional tensor.

Parameters:
- xs_pad (torch.Tensor) – Input tensor of shape (B, L, D), where B is the batch size, L is the sequence length, and D is the feature dimension.
- ilens (torch.Tensor) – Tensor of shape (B,) representing the lengths of each input sequence in the batch.
- prev_states (torch.Tensor , optional) – Previous states (not used in this implementation). Defaults to None.
Returns:
- A tensor containing the position-embedded output of shape (B, T, C), where T is the output sequence length and C is the output dimension.
- A tensor of shape (B,) representing the lengths of the output sequences.
- An optional tensor (currently None).
Return type: Tuple[torch.Tensor, torch.Tensor, Optional[torch.Tensor]]

########### Examples

>>> encoder = FairSeqWav2Vec2Encoder(input_size=128, w2v_url="url")
>>> xs_pad = torch.randn(2, 100, 128)  # Example input
>>> ilens = torch.tensor([100, 80])  # Example lengths
>>> output, olens, _ = encoder.forward(xs_pad, ilens)
>>> print(output.shape)  # Should print: torch.Size([2, T, output_size])
>>> print(olens)  # Should print the lengths of the output sequences

####### NOTE The method automatically handles fine-tuning of the encoder parameters based on the number of updates.

Raises:
- RuntimeError – If the encoder fails to process the input due to
- incompatible dimensions or other issues. –

output_size() → int

Get the output size of the encoder.

This method returns the dimension of the output produced by the encoder. It is particularly useful for understanding the shape of the features that will be passed to subsequent layers in a neural network.

Returns: The output size of the encoder.
Return type: int

########### Examples

encoder = FairSeqWav2Vec2Encoder(input_size=512, output_size=256) size = encoder.output_size() print(size) # Output: 256

reload_pretrained_parameters()

Reload the pretrained parameters into the encoder.

This method loads the parameters that were initially stored in the pretrained_params attribute back into the encoder model. This is useful for restoring the original state of the model after fine-tuning or modifications have been made.

########### Examples

Create an instance of the encoder

encoder = FairSeqWav2Vec2Encoder(

input_size=256, w2v_url=’https://example.com/wav2vec2_model’, w2v_dir_path=’./models’,

)

Fine-tune the encoder (hypothetical fine-tuning code here)

…

Reload the original pretrained parameters

encoder.reload_pretrained_parameters()

####### NOTE This operation does not return any values but logs a message indicating that the parameters have been reloaded.