espnet2.asr.transducer.beam_search_transducer_streaming.BeamSearchTransducerStreaming

About 8 min

espnet2.asr.transducer.beam_search_transducer_streaming.BeamSearchTransducerStreaming

class espnet2.asr.transducer.beam_search_transducer_streaming.BeamSearchTransducerStreaming(decoder: AbsDecoder, joint_network: JointNetwork, beam_size: int, lm: Module | None = None, lm_weight: float = 0.1, search_type: str = 'default', max_sym_exp: int = 2, u_max: int = 50, nstep: int = 1, prefix_alpha: int = 1, expansion_gamma: int = 2.3, expansion_beta: int = 2, score_norm: bool = True, score_norm_during: bool = False, nbest: int = 1, penalty: float = 0.0, token_list: List[str] | None = None, hold_n: int = 0)

Bases: object

Beam search implementation for Transducer models.

This class performs beam search decoding for Transducer models, leveraging various search strategies including greedy, time-synchronous, and constrained beam search methods. It integrates an optional language model for enhanced decoding performance and allows customization of multiple parameters to tailor the search behavior.

decoder

An instance of AbsDecoder used for generating predictions.

joint_network

An instance of JointNetwork used for joint decoding.

beam_size

The number of hypotheses to maintain during search.

hidden_size

The size of the hidden states in the decoder.

vocab_size

The size of the vocabulary.

sos

The start-of-sequence token ID.

token_list

An optional list of tokens for decoding output.

blank_id

The ID of the blank token used in Transducer models.

penalty

The penalty applied during decoding to adjust scores.

search_algorithm

The selected search algorithm for decoding.

use_lm

A boolean indicating if a language model is used.

The language model used for scoring hypotheses.

_weight

Weighting factor for the language model’s contribution.

score_norm

A boolean indicating whether to normalize scores.

score_norm

_during

A boolean indicating if scores should be normalized during search.

nbest

The number of best hypotheses to return.

hold_n

The number of tokens to hold for incremental decoding.

Parameters:
- decoder – Decoder module.
- joint_network – Joint network module.
- beam_size – Beam size.
- lm – Language model class (optional).
- lm_weight – Weight for soft fusion with the language model (default: 0.1).
- search_type – Type of search algorithm to use during inference.
- max_sym_exp – Maximum number of symbol expansions at each time step (default: 2).
- u_max – Maximum output sequence length (default: 50).
- nstep – Maximum expansion steps at each time step (default: 1).
- prefix_alpha – Maximum prefix length in prefix search (default: 1).
- expansion_gamma – Log probability difference for prune-by-value method (default: 2.3).
- expansion_beta – Additional candidates for expanded hypotheses selection (default: 2).
- score_norm – Normalize final scores by length (default: True).
- score_norm_during – Normalize scores by length during search (default: False).
- nbest – Number of final hypotheses to return (default: 1).
- penalty – Penalty applied to scores (default: 0.0).
- token_list – Optional list of tokens for decoding output.
- hold_n – Number of tokens to hold for incremental decoding (default: 0).

##################

Example

>>> decoder = MyDecoder()
>>> joint_network = MyJointNetwork()
>>> beam_search = BeamSearchTransducerStreaming(
...     decoder=decoder,
...     joint_network=joint_network,
...     beam_size=5,
...     lm=my_language_model
... )
>>> enc_out = torch.randn(10, decoder.dunits)  # Example encoder output
>>> hypotheses = beam_search(enc_out)

Raises:NotImplementedError – If an unsupported search type or language model is provided.

############# NOTE The search_type can be one of the following: : - “default”

“greedy”
“tsd”
“alsd”
“nsc”
“maes”

Each type has its own specific behavior and performance characteristics.

Initialize Transducer search module.

Parameters:
- decoder – Decoder module.
- joint_network – Joint network module.
- beam_size – Beam size.
- lm – LM class.
- lm_weight – LM weight for soft fusion.
- search_type – Search algorithm to use during inference.
- max_sym_exp – Number of maximum symbol expansions at each time step. (TSD)
- u_max – Maximum output sequence length. (ALSD)
- nstep – Number of maximum expansion steps at each time step. (NSC/mAES)
- prefix_alpha – Maximum prefix length in prefix search. (NSC/mAES)
- expansion_beta – Number of additional candidates for expanded hypotheses selection. (mAES)
- expansion_gamma – Allowed logp difference for prune-by-value method. (mAES)
- score_norm – Normalize final scores by length. (“default”)
- score_norm_during – Normalize scores by length during search. (default, TSD, ALSD)
- nbest – Number of final hypothesis.

align_length_sync_decoding(enc_out: Tensor) → List[Hypothesis]

Alignment-length synchronous beam search implementation.

This method implements an alignment-length synchronous beam search algorithm based on the paper available at https://ieeexplore.ieee.org/document/9053040. The algorithm is designed to handle decoding in a way that aligns the output sequence length with the input sequence length, thereby maintaining a synchronous relationship during the decoding process.

Parameters:enc_out – Encoder output sequences. (T, D), where T is the number of time steps and D is the dimension of the encoder output.
Returns: N-best hypotheses, sorted by score, where each : hypothesis contains a sequence of predicted tokens and the associated score.
Return type: List[Hypothesis]

##################

Example

>>> model = BeamSearchTransducerStreaming(...)
>>> enc_output = torch.randn(100, model.hidden_size)
>>> nbest_hyps = model.align_length_sync_decoding(enc_output)
>>> for hyp in nbest_hyps:
...     print(hyp.yseq, hyp.score)

############# NOTE The method utilizes a maximum output length (u_max) to control the number of tokens that can be generated, and it processes the encoder outputs in a way that aligns with the length of the predicted sequences.

default_beam_search(enc_out: Tensor) → List[Hypothesis]

Beam search implementation.

This method performs a standard beam search decoding algorithm for transducer models, where the search explores multiple possible hypotheses at each decoding step. The algorithm retains the top scoring hypotheses for further expansion while discarding less promising candidates.

The implementation is inspired by the method described in the paper “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks” (https://arxiv.org/pdf/1211.3711.pdf).

Parameters:enc_out – Encoder output sequence of shape (T, D), where T is the number of time steps and D is the dimensionality of the encoder output.
Returns: A list of N-best hypotheses generated from the beam : search process, sorted by their scores in descending order.
Return type: nbest_hyps

##################

Example

>>> model = BeamSearchTransducerStreaming(decoder, joint_network,
...                                       beam_size=5)
>>> encoder_output = torch.rand(10, model.hidden_size)
>>> results = model.default_beam_search(encoder_output)
>>> for hyp in results:
...     print(hyp.yseq, hyp.score)

greedy_search(enc_out: Tensor) → List[Hypothesis]

Greedy search implementation for sequence decoding.

This method performs a greedy search on the encoder output to find the most likely sequence of hypotheses based on the given transducer model. The algorithm iteratively selects the token with the highest probability at each time step until the end of the sequence is reached.

Parameters:enc_out – A tensor representing the encoder output sequence of shape (T, D_enc), where T is the number of time steps and D_enc is the dimension of the encoder output.
Returns: A list containing the single best hypothesis, which : includes the score and the sequence of tokens predicted.
Return type: List[Hypothesis]

##################

Example

>>> enc_output = torch.randn(10, 128)  # Example encoder output
>>> transducer = BeamSearchTransducerStreaming(...)  # Initialized object
>>> best_hypothesis = transducer.greedy_search(enc_output)
>>> print(best_hypothesis[0].yseq)  # Output the predicted sequence

############# NOTE This method assumes that the decoder has been properly initialized and that the encoder output is valid. The output will always contain a single hypothesis, which represents the greedy choice made at each decoding step.

modified_adaptive_expansion_search(enc_out: Tensor) → List[ExtendedHypothesis]

Perform the modified Adaptive Expansion Search (mAES) for decoding.

This method implements the modified Adaptive Expansion Search algorithm, which is based on the work presented in https://ieeexplore.ieee.org/document/9250505 and incorporates elements from the N-step Constrained beam search (NSC).

Parameters:enc_out – Encoder output sequence. Shape (T, D_enc), where T is the number of time steps and D_enc is the dimension of the encoder output.
Returns: A list of the N-best hypotheses generated by the search algorithm, each represented as an ExtendedHypothesis instance.
Return type: List[ExtendedHypothesis]

##################

Example

>>> enc_out = torch.randn(10, 256)  # Example encoder output
>>> beam_search = BeamSearchTransducerStreaming(...)  # Initialize
>>> nbest_hyps = beam_search.modified_adaptive_expansion_search(enc_out)
>>> for hyp in nbest_hyps:
>>>     print(hyp.yseq, hyp.score)

############# NOTE This method is designed to be used within a beam search framework and relies on a well-defined decoder and joint network to compute scores and states for hypotheses.

nsc_beam_search(enc_out: Tensor) → List[ExtendedHypothesis]

N-step constrained beam search implementation.

This method performs N-step constrained beam search based on the input encoder output sequence. It is designed to efficiently search for the best hypotheses by constraining the number of steps and utilizing prefix search techniques. This algorithm is based on and modified from the paper “https://arxiv.org/pdf/2002.03577.pdf”. For any usage outside ESPnet, please reference ESPnet (b-flo, PR #2444) until further modifications are made.

Parameters:enc_out – Encoder output sequence. Shape is (T, D_enc), where T is the length of the sequence and D_enc is the dimensionality of the encoder output.
Returns: A list of N-best hypotheses sorted by score. : Each hypothesis includes the score, the sequence of tokens generated, the decoder state, and any associated language model scores.
Return type: List[ExtendedHypothesis]

##################

Example

>>> # Assuming enc_out is a tensor of appropriate shape
>>> decoder = ...  # Initialize your decoder
>>> joint_network = ...  # Initialize your joint network
>>> beam_search = BeamSearchTransducerStreaming(decoder, joint_network, beam_size=5)
>>> nbest_hyps = beam_search.nsc_beam_search(enc_out)
>>> for hyp in nbest_hyps:
>>>     print(hyp.yseq, hyp.score)

############# NOTE This implementation may require a language model (LM) for scoring, which can be provided during initialization of the BeamSearchTransducerStreaming class.

Raises:NotImplementedError – If an unsupported language model type is used during initialization.

prefix_search(hyps: List[ExtendedHypothesis], enc_out_t: Tensor) → List[ExtendedHypothesis]

Prefix search for NSC and mAES strategies.

This method performs a prefix search among the given hypotheses to update their scores based on the encoder output at the current time step. The search is designed to be efficient by leveraging the prefix nature of the hypotheses, allowing for effective pruning and score adjustment.

Parameters:
- hyps – A list of ExtendedHypothesis objects representing the current hypotheses.
- enc_out_t – The encoder output tensor for the current time step, shaped (D_enc,).
Returns: A list of ExtendedHypothesis objects with updated scores.

##################

Example

>>> beam_search = BeamSearchTransducerStreaming(...)
>>> current_hyps = [...]  # List of ExtendedHypothesis objects
>>> enc_out_t = torch.tensor([...])  # Current encoder output
>>> updated_hyps = beam_search.prefix_search(current_hyps, enc_out_t)

############# NOTE This implementation is based on the methodology described in the paper: https://arxiv.org/pdf/1211.3711.pdf

reset()

Reset the beam search state.

This method initializes the beam search state by resetting the hypotheses and beam states. It sets the initial hypotheses with the blank token and a score of zero, and initializes the decoder’s state for the beam size.

beam

The effective beam size, constrained by the vocabulary size.

Type: int

beam

_state

The initial state of the decoder for the beam.

The list of hypotheses to keep track of during the search process.

Type: List[Hypothesis]

cache

A cache for storing intermediate results.

Type: dict

Example

>>> beam_search_transducer = BeamSearchTransducerStreaming(...)
>>> beam_search_transducer.reset()
>>> print(beam_search_transducer.B)  # Should show initial hypotheses

############# NOTE This method is called at the beginning of each decoding session and after final results are obtained to prepare for the next decoding task.

sort_nbest(hyps: List[Hypothesis] | List[ExtendedHypothesis]) → List[Hypothesis] | List[ExtendedHypothesis]

Sort hypotheses by score or score given sequence length.

This method sorts the provided hypotheses based on their scores. The sorting can be done in two ways:

By the raw score (if score_norm is False).
By the score normalized by the length of the sequence (if score_norm is True).

Parameters:hyps – A list of hypotheses to be sorted. The hypotheses can be of type Hypothesis or ExtendedHypothesis.
Returns: A list of the top nbest sorted hypotheses.

##################

Example

>>> beam_search = BeamSearchTransducerStreaming(...)
>>> hyps = [Hypothesis(score=5.0, yseq=[1, 2], dec_state=...),
...         Hypothesis(score=3.0, yseq=[1, 3], dec_state=...)]
>>> sorted_hyps = beam_search.sort_nbest(hyps)
>>> print(sorted_hyps)
[Hypothesis(score=5.0, yseq=[1, 2], dec_state=...),
 Hypothesis(score=3.0, yseq=[1, 3], dec_state=...)]

############# NOTE The sorting will keep only the top nbest hypotheses in the returned list.

time_sync_decoding(enc_out: Tensor) → List[Hypothesis]

Time synchronous beam search implementation.

This method performs a time-synchronous beam search decoding using the encoder output. It generates N-best hypotheses based on the provided encoder output sequence. The algorithm follows the principles outlined in the research paper: “End-to-End Speech Recognition with Transformer” (https://ieeexplore.ieee.org/document/9053040).

Parameters:enc_out – A tensor representing the encoder output sequence with shape (T, D), where T is the number of time steps and D is the dimensionality of the output.
Returns: A list containing the N-best hypotheses, where each hypothesis includes a sequence of predicted tokens and their associated scores.
Return type: List[Hypothesis]

##################

Example

>>> encoder_output = torch.randn(10, 256)  # Example encoder output
>>> decoder = BeamSearchTransducerStreaming(...)  # Initialize the decoder
>>> hypotheses = decoder.time_sync_decoding(encoder_output)
>>> for hyp in hypotheses:
>>>     print(f'Score: {hyp.score}, Sequence: {hyp.yseq}')

############# NOTE The method supports language model integration if a language model is provided during the initialization of the decoder. The scoring incorporates language model scores based on the specified parameters.

Raises:
- ValueError – If the encoder output tensor does not conform to the
- expected shape or dimensions. –