espnet2.asr.transducer.beam_search_transducer.BeamSearchTransducer

About 8 min

espnet2.asr.transducer.beam_search_transducer.BeamSearchTransducer

class espnet2.asr.transducer.beam_search_transducer.BeamSearchTransducer(decoder: AbsDecoder, joint_network: JointNetwork, beam_size: int, lm: Module | None = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 2, u_max: int = 50, nstep: int = 1, prefix_alpha: int = 1, expansion_gamma: int = 2.3, expansion_beta: int = 2, multi_blank_durations: List[int] = [], multi_blank_indices: List[int] = [], score_norm: bool = True, score_norm_during: bool = False, nbest: int = 1, token_list: List[str] | None = None)

Bases: object

Beam search implementation for Transducer models.

This class implements various beam search algorithms for Transducer models used in automatic speech recognition (ASR). It allows for flexible decoding strategies such as greedy search, beam search, and more advanced techniques like N-step constrained search and modified adaptive expansion search.

decoder

The decoder module used for generating predictions.

Type:AbsDecoder

joint_network

The joint network module used for combining encoder and decoder outputs.

Type:JointNetwork

beam_size

The size of the beam for beam search.

Type: int

Language model for soft fusion.

Type: torch.nn.Module, optional

_weight

Weight for the language model during decoding.

Type: float

search_type

The search algorithm to use during inference.

Type: str

max_sym_exp

Maximum number of symbol expansions at each time step.

Type: int

u_max

Maximum output sequence length.

Type: int

nstep

Number of maximum expansion steps at each time step.

Type: int

prefix_alpha

Maximum prefix length in prefix search.

Type: int

expansion_gamma

Allowed log probability difference for pruning.

Type: int

expansion_beta

Number of additional candidates for expanded hypotheses selection.

Type: int

multi_blank_durations

The duration of each blank token.

Type: List[int]

multi_blank_indices

The index of each blank token in token_list.

Type: List[int]

score_norm

Normalize final scores by length.

Type: bool

score_norm

_during

Normalize scores by length during search.

Type: bool

nbest

Number of final hypotheses.

Type: int

token_list

List of tokens used in decoding.

Type: List[str], optional
Parameters:
- decoder (AbsDecoder) – Decoder module.
- joint_network (JointNetwork) – Joint network module.
- beam_size (int) – Beam size.
- lm (torch.nn.Module , optional) – Language model class.
- lm_weight (float) – Language model weight for soft fusion.
- search_type (str) – Search algorithm to use during inference.
- max_sym_exp (int) – Maximum symbol expansions at each time step.
- u_max (int) – Maximum output sequence length.
- nstep (int) – Maximum expansion steps at each time step.
- prefix_alpha (int) – Maximum prefix length in prefix search.
- expansion_beta (int) – Additional candidates for expanded hypotheses.
- expansion_gamma (int) – Log probability difference for pruning.
- multi_blank_durations (List *[*int ]) – Duration of each blank token.
- multi_blank_indices (List *[*int ]) – Index of each blank token.
- score_norm (bool) – Normalize final scores by length.
- score_norm_during (bool) – Normalize scores by length during search.
- nbest (int) – Number of final hypotheses.
- token_list (List *[*str ] , optional) – List of tokens.
Returns: N-best decoding results, depending on the search type used.
Return type: List[Hypothesis] or List[ExtendedHypothesis]

####################### Examples

>>> beam_search = BeamSearchTransducer(decoder, joint_network,
...                                     beam_size=5)
>>> enc_out = torch.randn(10, 256)  # Example encoder output
>>> nbest_hyps = beam_search(enc_out)
>>> for hyp in nbest_hyps:
...     print(hyp.yseq, hyp.score)

############## NOTE Ensure that the chosen search_type is compatible with the provided language model, if any.

Raises:NotImplementedError – If an unsupported search type is provided.

Initialize Transducer search module.

Parameters:
- decoder – Decoder module.
- joint_network – Joint network module.
- beam_size – Beam size.
- lm – LM class.
- lm_weight – LM weight for soft fusion.
- search_type – Search algorithm to use during inference.
- max_sym_exp – Number of maximum symbol expansions at each time step. (TSD)
- u_max – Maximum output sequence length. (ALSD)
- nstep – Number of maximum expansion steps at each time step. (NSC/mAES)
- prefix_alpha – Maximum prefix length in prefix search. (NSC/mAES)
- expansion_beta – Number of additional candidates for expanded hypotheses selection. (mAES)
- expansion_gamma – Allowed logp difference for prune-by-value method. (mAES)
- multi_blank_durations – The duration of each blank token. (MBG)
- multi_blank_indices – The index of each blank token in token_list. (MBG)
- score_norm – Normalize final scores by length. (“default”)
- score_norm_during – Normalize scores by length during search. (default, TSD, ALSD)
- nbest – Number of final hypothesis.

align_length_sync_decoding(enc_out: Tensor) → List[Hypothesis]

Alignment-length synchronous beam search implementation.

This method performs an alignment-length synchronous beam search for decoding sequences from an encoder output. The search iterates through the time steps of the encoder output and maintains hypotheses that align the length of the output sequences with the input encoder output.

The implementation is based on the research presented in: https://ieeexplore.ieee.org/document/9053040

Parameters:enc_out – A tensor representing the encoder output sequences. Shape should be (T, D), where T is the number of time steps and D is the dimension of the encoder output.
Returns: A list of N-best hypotheses, sorted by score.
Return type: List[Hypothesis]

####################### Examples

>>> enc_output = torch.rand(10, 256)  # Simulated encoder output
>>> beam_search_transducer = BeamSearchTransducer(...)
>>> nbest_hyps = beam_search_transducer.align_length_sync_decoding(enc_output)
>>> for hyp in nbest_hyps:
...     print(hyp.yseq, hyp.score)

############## NOTE This method may produce a smaller number of final hypotheses if no valid sequences are found that satisfy the length alignment criteria.

default_beam_search(enc_out: Tensor) → List[Hypothesis]

Beam search implementation.

This method performs beam search decoding for Transducer models. It utilizes the encoder output to generate N-best hypotheses. The search process is based on the principles outlined in the paper: “Sequence to Sequence Learning with Neural Networks” (https://arxiv.org/pdf/1211.3711.pdf).

Parameters:enc_out – Encoder output sequence. The shape should be (T, D), where T is the time dimension and D is the feature dimension.
Returns: A list of N-best hypotheses, each represented by : a Hypothesis object containing the score, output sequence, and decoder state.
Return type: List[Hypothesis]

####################### Examples

>>> enc_output = torch.randn(10, 256)  # Example encoder output
>>> beam_search = BeamSearchTransducer(decoder, joint_network, beam_size=5)
>>> nbest_hyps = beam_search.default_beam_search(enc_output)
>>> for hyp in nbest_hyps:
...     print(hyp.yseq, hyp.score)

############## NOTE Ensure that the encoder output has been properly generated from the preceding model layers before calling this method.

greedy_search(enc_out: Tensor) → List[Hypothesis]

Greedy search implementation.

This method performs a greedy search on the given encoder output sequence. It initializes a hypothesis with the blank token and iteratively updates the hypothesis by predicting the next token at each time step based on the encoder’s output and the decoder’s scores.

Parameters:enc_out – Encoder output sequence of shape (T, D_enc), where T is the number of time steps and D_enc is the dimension of the encoder’s output.
Returns: A list containing the 1-best hypothesis found during the greedy search, which includes the score, predicted token sequence, and decoder state.
Return type: List[Hypothesis]

####################### Examples

>>> enc_out = torch.randn(10, 128)  # Example encoder output
>>> beam_search_transducer = BeamSearchTransducer(...)
>>> best_hypothesis = beam_search_transducer.greedy_search(enc_out)
>>> print(best_hypothesis[0].yseq)  # Output the predicted sequence

############## NOTE This implementation assumes that the decoder has been initialized and is ready to perform scoring.

modified_adaptive_expansion_search(enc_out: Tensor) → List[ExtendedHypothesis]

Perform modified Adaptive Expansion Search (mAES) for decoding.

This method implements the modified Adaptive Expansion Search algorithm, which expands hypotheses based on the score and allows for efficient search in the decoding process. It utilizes a prefix search strategy combined with adaptive expansion to enhance performance.

The algorithm is based on and modified from:

https://ieeexplore.ieee.org/document/9250505
N-step constrained beam search (NSC).

Parameters:enc_out – Tensor of shape (T, D_enc) representing the encoder output sequence, where T is the sequence length and D_enc is the encoder output dimension.
Returns: A list of N-best hypotheses sorted by their scores.
Return type: List[ExtendedHypothesis]

####################### Examples

>>> # Assuming `encoder_output` is a valid tensor from the encoder
>>> search = BeamSearchTransducer(...)  # Initialize with parameters
>>> best_hyps = search.modified_adaptive_expansion_search(encoder_output)
>>> for hyp in best_hyps:
>>>     print(hyp.yseq, hyp.score)

############## NOTE The number of expansions and scoring mechanisms can be adjusted via the class parameters to optimize performance based on the specific use case.

Raises:
- NotImplementedError – If the search type or configuration is not
- supported or implemented. –

multi_blank_greedy_search(enc_out: Tensor) → List[Hypothesis]

Greedy search for Multi-Blank Transducer (Multi-Blank Greedy, MBG).

This implementation assumes:

The index of the standard blank is the last entry of

self.multi_blank_indices rather than self.blank_id to minimize changes to the original transducer.

Other entries in self.multi_blank_indices represent large blanks that account for multiple frames.

The algorithm processes the encoder output and generates a sequence of hypotheses by selectively predicting tokens based on the state of the decoder and the encoder output.

Parameters:enc_out – Encoder output sequence. Shape: (T, D_enc)
Returns: A list containing the 1-best hypothesis.
Return type: List[Hypothesis]

####################### Examples

>>> enc_out = torch.randn(10, 256)  # Example encoder output
>>> search = BeamSearchTransducer(...)  # Initialize with required params
>>> best_hypothesis = search.multi_blank_greedy_search(enc_out)
>>> print(best_hypothesis[0].yseq)  # Display the predicted sequence

############## NOTE This search method is particularly useful in scenarios where the model needs to account for varying lengths of blank tokens in the output sequence.

nsc_beam_search(enc_out: Tensor) → List[ExtendedHypothesis]

N-step constrained beam search implementation.

This method performs an N-step constrained beam search, which allows for a specified number of expansions (nstep) during the decoding process. The search is based on the approach outlined in the paper: “N-step Constrained Beam Search for Sequence-to-Sequence Models”. For any usage outside ESPnet, please reference ESPnet (b-flo, PR #2444) until further modifications. Based on/Modified from https://arxiv.org/pdf/2002.03577.pdf.

Parameters:enc_out – Encoder output sequence with shape (T, D_enc), where T is the length of the sequence and D_enc is the dimension of the encoder output.
Returns: A list of N-best hypotheses sorted by their scores.
Return type: nbest_hyps

####################### Examples

>>> # Assuming enc_out is a tensor of shape (T, D_enc)
>>> nbest_hyps = beam_search_transducer.nsc_beam_search(enc_out)
>>> for hyp in nbest_hyps:
>>>     print(f"Score: {hyp.score}, Sequence: {hyp.yseq}")

############## NOTE This implementation is designed to optimize the decoding process for transducer models, providing flexibility in the number of allowed expansions and the ability to incorporate language models if available.

Raises:ValueError – If the shape of enc_out is not compatible or if nstep is less than 1.

prefix_search(hyps: List[ExtendedHypothesis], enc_out_t: Tensor) → List[ExtendedHypothesis]

Prefix search for NSC and mAES strategies.

This method performs a prefix search over a list of hypotheses to update their scores based on the current encoder output. It identifies hypotheses that share a common prefix and adjusts their scores by considering the log probabilities of extending these prefixes.

The algorithm operates as follows:

Iterate through each hypothesis and compare it with others.
Check if one hypothesis is a prefix of another and if the

difference in their lengths is within the allowed prefix length.

Calculate the log probability of extending the current hypothesis using the joint network and update the scores accordingly.

Parameters:
- hyps – A list of ExtendedHypothesis instances representing the current hypotheses.
- enc_out_t – The encoder output for the current time step.
Returns: The updated list of hypotheses with adjusted scores.
Return type: List[ExtendedHypothesis]

####################### Examples

>>> enc_out_t = torch.randn(1, 256)  # Example encoder output
>>> hyp1 = ExtendedHypothesis(yseq=[1, 2], score=0.5, dec_out=[...])
>>> hyp2 = ExtendedHypothesis(yseq=[1, 2, 3], score=0.6, dec_out=[...])
>>> hyps = [hyp1, hyp2]
>>> updated_hyps = prefix_search(hyps, enc_out_t)

############## NOTE This method is particularly useful in the context of N-step constrained beam search (NSC) and modified adaptive expansion search (mAES) strategies, where maintaining and updating prefixes efficiently can lead to better decoding performance.

sort_nbest(hyps: List[Hypothesis] | List[ExtendedHypothesis]) → List[Hypothesis] | List[ExtendedHypothesis]

Sort hypotheses by score or score given sequence length.

This method sorts the provided list of hypotheses based on their scores. It can normalize the scores by the length of the sequences if specified.

Parameters:hyps – A list of hypotheses to be sorted. This can be a list of either Hypothesis or ExtendedHypothesis objects.
Returns: A list of the top N best hypotheses, sorted in descending order of score (or normalized score if applicable).

####################### Examples

>>> hyp1 = Hypothesis(score=10.0, yseq=[1, 2, 3], dec_state=None)
>>> hyp2 = Hypothesis(score=12.0, yseq=[1, 2], dec_state=None)
>>> hyp3 = Hypothesis(score=9.0, yseq=[1, 2, 3, 4], dec_state=None)
>>> sorted_hyps = sort_nbest([hyp1, hyp2, hyp3])
>>> print([hyp.score for hyp in sorted_hyps])
[12.0, 10.0, 9.0]

############## NOTE If score_norm is set to True, the scores will be divided by the length of the sequences before sorting.

time_sync_decoding(enc_out: Tensor) → List[Hypothesis]

Time synchronous beam search implementation.

This method performs a time synchronous beam search over the encoder output sequence. It maintains a beam of hypotheses and expands them at each time step based on the scores from the joint network. The search is guided by the beam size and considers the language model if provided.

This implementation is based on the work found in: https://ieeexplore.ieee.org/document/9053040

Parameters:enc_out – Encoder output sequence. Shape (T, D), where T is the length of the sequence and D is the dimension of the encoder output.
Returns: A list of N-best hypotheses sorted by score, each : represented as an instance of the Hypothesis class.
Return type: nbest_hyps

####################### Examples

>>> # Example usage of the time_sync_decoding method
>>> beam_search = BeamSearchTransducer(decoder, joint_network, beam_size=5)
>>> encoder_output = torch.randn(10, 256)  # Example encoder output
>>> results = beam_search.time_sync_decoding(encoder_output)
>>> for hyp in results:
>>>     print(hyp.yseq, hyp.score)

############## NOTE Ensure that the decoder and joint_network have been properly initialized before calling this method.