espnet2.speechlm.module.valle.ValleNARDecoder
espnet2.speechlm.module.valle.ValleNARDecoder
class espnet2.speechlm.module.valle.ValleNARDecoder(n_level: int, n_ctx: int, n_state: int, n_head: int, n_layer: int, causal: bool = True, layer_class=<class 'espnet2.speechlm.module.valle.ResidualAttentionBlockAdaLM'>)
Bases: TransformerDecoder
ValleNARDecoder is a non-autoregressive transformer decoder designed for
speech processing tasks. It utilizes a series of residual attention blocks with AdaLN for layer normalization, enabling effective context handling and embedding representations.
level_emb
Embedding layer for level inputs.
- Type: nn.Embedding
ln
Adaptive layer normalization layer.
Type:AdaLN
Parameters:
- n_level (int) – Number of different levels for the input.
- n_ctx (int) – Context size for the input sequences.
- n_state (int) – Dimensionality of the model’s hidden states.
- n_head (int) – Number of attention heads in each layer.
- n_layer (int) – Total number of layers in the decoder.
- causal (bool , optional) – Whether to use causal masking (default is True).
- layer_class (type , optional) – Class for the residual attention block (default is ResidualAttentionBlockAdaLM).
Returns: The output tensor after processing through the decoder.
Return type: Tensor
####### Examples
>>> decoder = ValleNARDecoder(n_level=10, n_ctx=20, n_state=512,
... n_head=8, n_layer=6)
>>> x = torch.randn(1, 20, 512) # Batch size of 1, sequence length 20
>>> level = torch.randint(0, 10, (1, 20)) # Random level inputs
>>> output = decoder(x, level)
NOTE
This decoder is particularly suited for tasks in speech language modeling and is part of the ESPnet2 framework.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
forward(x: Tensor, level: Tensor, kv_cache: dict | None = None)
Forward pass for the ValleNARDecoder class.
This method processes the input tensor x through a series of attention blocks and layer normalization, incorporating positional embeddings and level embeddings. It can optionally utilize a key-value cache for efficient decoding in scenarios such as autoregressive generation.
- Parameters:
- x (Tensor) – The input tensor of shape (batch_size, sequence_length, n_state).
- level (Tensor) – The level indices tensor of shape (batch_size,).
- kv_cache (Optional *[*dict ] , optional) – A dictionary containing cached key-value pairs for cross-attention. Defaults to None.
- Returns: The output tensor after processing through the decoder layers, : of shape (batch_size, sequence_length, n_state).
- Return type: Tensor
####### Examples
>>> decoder = ValleNARDecoder(n_level=10, n_ctx=512, n_state=256,
... n_head=8, n_layer=6)
>>> input_tensor = torch.randn(2, 20, 256) # Batch of 2, sequence length of 20
>>> level_tensor = torch.tensor([0, 1]) # Two levels for the batch
>>> output = decoder(input_tensor, level_tensor)
>>> print(output.shape) # Output shape should be (2, 20, 256)
NOTE
This method assumes that the input x has already been embedded into the appropriate shape and dimensionality.