publications Category

Context-Aware Dynamic Pruning for Speech Foundation Models

Abstract

This study proposes a context-aware dynamic pruning method for multilingual and multitask speech foundation models. The method achieves up to 30% reduction in inference cost while maintaining model accuracy. Unlike conventional pruning, which is fixed during training, our method enables flexible module-level pruning based on contextual cues such as language, speaker, and task during inference.

Masao SomekiAbout 2 min

ESPnet-EZ: Python-Only ESPnet for Easy Fine-Tuning and Integration

arXiv

Overview

ESPnet is a powerful toolkit for reproducible experiments and large-scale training on clusters. However, its reliance on complex shell scripts and a wide range of pre-defined recipes has made it difficult for newcomers to get started.

Masao SomekiAbout 2 min

ESPnet-ONNX: Bridging a Gap Between Research and Production

arXiv

Overview

ESPnet-ONNX is a framework that enables the conversion of PyTorch-based speech models developed in ESPnet into ONNX format, making them suitable for real-world deployment. It specifically targets efficient inference on CPUs and compatibility with various runtime environments like C++.

Masao SomekiLess than 1 minute

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

arXiv

Abstract

By combining autoregressive (AR) and non-autoregressive (NAR) decoding, we developed a partially autoregressive (PAR) approach that leverages the strengths of both methods. This resulted in a 12-13x speedup with minimal accuracy degradation. A key advantage of PAR is that it doesn’t require training a new model; we can achieve this speedup using a pre-trained AR model. I've integrated this decoding algorithm into ESPnet: #5769

Masao SomekiAbout 3 min

Comparative Analysis: Transformer vs. RNN in Speech Applications

arXiv

Abstract

This paper provides an extensive comparison between Transformer and Recurrent Neural Networks (RNN) across a wide range of speech tasks: automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). Surprisingly, Transformer outperforms RNN in 13 out of 15 ASR benchmarks, and performs comparably in ST and TTS tasks. In addition to accuracy, the paper shares valuable training tips and reports that Transformer often benefits more from larger mini-batches and multiple GPUs. All experiments are integrated into ESPnet with reproducible recipes.

Masao SomekiAbout 3 min