espnet2.speechlm.module.valle.ValleNARDecoder

About 2 min

espnet2.speechlm.module.valle.ValleNARDecoder

class espnet2.speechlm.module.valle.ValleNARDecoder(n_level: int, n_ctx: int, n_state: int, n_head: int, n_layer: int, causal: bool = True, layer_class=<class 'espnet2.speechlm.module.valle.ResidualAttentionBlockAdaLM'>)

Bases: TransformerDecoder

ValleNARDecoder is a non-autoregressive transformer decoder designed for

speech processing tasks. It utilizes a series of residual attention blocks with AdaLN for layer normalization, enabling effective context handling and embedding representations.

level_emb

Embedding layer for level inputs.

Type: nn.Embedding

Adaptive layer normalization layer.

Type:AdaLN
Parameters:
- n_level (int) – Number of different levels for the input.
- n_ctx (int) – Context size for the input sequences.
- n_state (int) – Dimensionality of the model’s hidden states.
- n_head (int) – Number of attention heads in each layer.
- n_layer (int) – Total number of layers in the decoder.
- causal (bool , optional) – Whether to use causal masking (default is True).
- layer_class (type , optional) – Class for the residual attention block (default is ResidualAttentionBlockAdaLM).
Returns: The output tensor after processing through the decoder.
Return type: Tensor

####### Examples

>>> decoder = ValleNARDecoder(n_level=10, n_ctx=20, n_state=512,
...                            n_head=8, n_layer=6)
>>> x = torch.randn(1, 20, 512)  # Batch size of 1, sequence length 20
>>> level = torch.randint(0, 10, (1, 20))  # Random level inputs
>>> output = decoder(x, level)

NOTE

This decoder is particularly suited for tasks in speech language modeling and is part of the ESPnet2 framework.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(x: Tensor, level: Tensor, kv_cache: dict | None = None)

Forward pass for the ValleNARDecoder class.

This method processes the input tensor x through a series of attention blocks and layer normalization, incorporating positional embeddings and level embeddings. It can optionally utilize a key-value cache for efficient decoding in scenarios such as autoregressive generation.

Parameters:
- x (Tensor) – The input tensor of shape (batch_size, sequence_length, n_state).
- level (Tensor) – The level indices tensor of shape (batch_size,).
- kv_cache (Optional *[*dict ] , optional) – A dictionary containing cached key-value pairs for cross-attention. Defaults to None.
Returns: The output tensor after processing through the decoder layers, : of shape (batch_size, sequence_length, n_state).
Return type: Tensor

####### Examples

>>> decoder = ValleNARDecoder(n_level=10, n_ctx=512, n_state=256,
...                            n_head=8, n_layer=6)
>>> input_tensor = torch.randn(2, 20, 256)  # Batch of 2, sequence length of 20
>>> level_tensor = torch.tensor([0, 1])  # Two levels for the batch
>>> output = decoder(input_tensor, level_tensor)
>>> print(output.shape)  # Output shape should be (2, 20, 256)

NOTE

This method assumes that the input x has already been embedded into the appropriate shape and dimensionality.