Comparative Analysis: Transformer vs. RNN in Speech Applications

Abstract

This paper provides an extensive comparison between Transformer and Recurrent Neural Networks (RNN) across a wide range of speech tasks: automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). Surprisingly, Transformer outperforms RNN in 13 out of 15 ASR benchmarks, and performs comparably in ST and TTS tasks. In addition to accuracy, the paper shares valuable training tips and reports that Transformer often benefits more from larger mini-batches and multiple GPUs. All experiments are integrated into ESPnet with reproducible recipes.

1. Motivation: Why compare Transformer and RNN?

Transformer has revolutionized NLP, but its application to speech tasks remained underexplored due to its computational complexity and training sensitivity.

RNNs, especially LSTMs, have long been the go-to choice for sequential speech processing due to their natural alignment with time-series data.

This paper seeks to:

Quantify how much Transformer improves accuracy in ASR, ST, and TTS.
Share practical training tricks.
Provide reproducible recipes in ESPnet.

2. Architecture Overview

🔁 Recurrent Neural Networks (RNN)

Encoder: Bi-directional LSTM (BLSTM)
Decoder: Uni-directional LSTM with attention
Natural for temporal data but inherently sequential, making parallelism difficult

🔀 Transformer

Relies on self-attention instead of recurrence
Supports full parallelization in both encoder and decoder
Requires positional encoding to retain time structure

📝 Both models share a sequence-to-sequence (S2S) structure:

EncPre → EncBody → DecPre → DecBody → DecPost

3. Application to Speech Tasks

🗣️ ASR (Automatic Speech Recognition)

Input: log-mel features with pitch features (83-dim) [1]
Loss: Weighted sum of S2S and CTC losses
Decoding combines S2S, CTC, and optionally RNN-LM

🌐 ST (Speech Translation)

Input: speech in one language
Output: translated text in another
Same structure as ASR but without CTC due to non-monotonic alignment

🗣️→📊 TTS (Text-to-Speech)

Input: Text sequence
Output: Mel-spectrogram sequence + EOS probability
Loss: Combination of L1 loss and binary cross-entropy (BCE)

Training uses teacher-forcing; inference is autoregressive.

4. ASR: Experimental Results

✅ Benchmarks on 15 ASR tasks

Languages: English, Japanese, Chinese, Spanish, Italian
Conditions: clean, noisy, far-field, low-resource
Transformer outperformed RNN on 13 out of 15 corpora
Even without a pronunciation dictionary or alignment, Transformer matches or surpasses Kaldi in many tasks

🔧 Training Tips

Transformer benefits greatly from large minibatches
Accumulating gradients helps when GPUs are limited
Dropout is essential for Transformer
SpecAugment and speed perturbation improve both models
Same decoding hyperparameters (CTC/LM weight) can be reused

5. Multilingual ASR

Single model trained on 10 languages (e.g., English, Japanese, Spanish, Mandarin)
Output units: shared grapheme vocabulary (5,297 symbols)
Transformer shows strong language-agnostic performance
Achieved >10% relative improvement in 8 languages

6. Speech Translation (ST)

Dataset: Fisher–CallHome (English–Spanish)
Transformer BLEU: 17.2 (vs RNN: 16.5)
Reused encoder from ASR to alleviate underfitting
Transformer still requires careful training on low-resource ST tasks

7. Text-to-Speech (TTS)

Compared on M-AILABS (Italian) and LJSpeech (English)
Validation loss (L1) is comparable between models
Transformer learns better with large batches
Guided attention loss selectively applied to a few attention heads

⚠️ Transformer TTS Challenges

Decoding is significantly slower than RNN
FastSpeech reduces latency (0.6 ms/frame vs 78 ms with Transformer)

8. Conclusion

Transformer brings:

✅ Better accuracy ✅ Easier scaling with multiple GPUs ✅ Reproducible and open recipes (via ESPnet)

But...

⚠️ Needs careful training ⚠️ Inference can be slower, especially in TTS

Nonetheless, this work strongly positions Transformer as a preferred architecture for end-to-end speech tasks.

9. Resources

GitHub: ESPnet Toolkit
Audio samples & TTS demos: bit.ly/329gif5
P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J.Trmal, and S. Khudanpur, “A pitch extraction algorithm tuned for automatic speech recognition,” in ICASSP, 2014, pp. 2494–2498.

📌 Citation

If you find this useful, please cite:

@INPROCEEDINGS{9003750,
  author={Karita, Shigeki and Chen, Nanxin and Hayashi, Tomoki and Hori, Takaaki and Inaguma, Hirofumi and Jiang, Ziyan and Someki, Masao and Soplin, Nelson Enrique Yalta and Yamamoto, Ryuichi and Wang, Xiaofei and Watanabe, Shinji and Yoshimura, Takenori and Zhang, Wangyou},
  booktitle={2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)}, 
  title={A Comparative Study on Transformer vs RNN in Speech Applications}, 
  year={2019},
  pages={449-456},
  keywords={Decoding;Training;Task analysis;Xenon;Recurrent neural networks;Speech recognition;Transforms;Transformer;Recurrent Neural Networks;Speech Recognition;Text-to-Speech;Speech Translation},
  doi={10.1109/ASRU46091.2019.9003750}
}