End-to-End Speech Recognition in Agglutinative Languages

Mamyrbayev, Orken; Alimhan, Keylan; Zhumazhanov, Bagashar; Turdalykyzy, Tolganay; Gusmanova, Farida

doi:10.1007/978-3-030-42058-1_33

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12034))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

1673 Accesses
12 Citations

Abstract

This paper considers end-to-end speech recognition systems based on deep neural networks (DNN). The studies used different types of neural networks, CTC model and attention-based encoder-decoder models. As a result of the study, it was proved that the CTC model works without language models directly for agglutinative languages, but the best is ResNet with 11.52% of CER and 19.57% of WER of using the language model. An experiment with the BLSTM neural network using the attention-based encoder-decoder models showed 8.01% of CER of and 17.91% of WER. Using the experiment, it was proved that without integrating language models, good results can be achieved. The best result showed ResNet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Perera, F.P., et al.: Relationship between polycyclic aromatic hydrocarbon–DNA adducts and proximity to the World Trade Center and effects on fetal growth. Environ. Health Perspect. 113, 1062–1067 (2005)
Article Google Scholar
Mamyrbayev, O., Turdalyuly, M., Mekebayev, N., Alimhan, K., Kydyrbekova, A., Turdalykyzy, T.: Automatic recognition of Kazakh speech using deep neural networks. In: Nguyen, N.T., Gaol, F.L., Hong, T.-P., Trawiński, B. (eds.) ACIIDS 2019. LNCS (LNAI), vol. 11432, pp. 465–474. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-14802-7_40
Chapter Google Scholar
Mikolov, T., et al.: Recurrent neural network based language model. Interspeech 2, 1045–1048 (2010)
Google Scholar
Rao, K., Peng, F., Sak, H., Beaufays, F.: Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4225–4229 (2015)
Google Scholar
Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves using restricted boltzmann machines. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5887 (2011)
Google Scholar
Smolensky, P.: Information processing in dynamical systems: foundations of harmony theory. Colorado University at Boulder Department of Computer Science, pp. 194–281 (1986)
Google Scholar
Vaněk, J., Zelinka, J., Soutner, D., Psutka, J.: A regularization post layer: an additional way how to make deep neural networks robust. In: Camelin, N., Estève, Y., Martín-Vide, C. (eds.) SLSP 2017. LNCS (LNAI), vol. 10583, pp. 204–214. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68456-7_17
Chapter Google Scholar
Kim, S., Hori, T., Watanabe, S.: Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)
Google Scholar
Aida-Zade, K., Rustamov, S., Mustafayev, E.: Principles of construction of speech recognition system by the example of Azerbaijan language. In: International Symposium on Innovations in Intelligent Systems and Applications, pp. 378–382 (2009)
Google Scholar
Hannun, A., et al.: DeepSpeech: scaling up end-to-end speech recognition, arXiv:1412.5567 (2014)
Zhang, Z., et al.: Deep recurrent convolutional neural network: improving performance for speech recognition (2016). preprint: arXiv:1611.07174. https://arxiv.org/abs/1611.07174
Bahdanau, D., et al.: End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4945–4949. IEEE (2016)
Google Scholar
Zhang, Y., et al.: Towards end-to-end speech recognition with deep convolutional neural networks (2017). preprint: arXiv:1701.02720. https://arxiv.org/abs/1701.02720
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, 4 p. IEEE Signal Processing Society (2011)
Google Scholar
Soltau, H., Liao, H., Sak, H.: Neural speech recognizer: acoustic-to-word LSTM model for large vocabulary speech recognition. arXiv:1610.09975 (2016)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Google Scholar
Popović, B., Pakoci, E., Pekar, D.: End-to-End large vocabulary speech recognition for the Serbian language. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 343–352. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66429-3_33
Chapter Google Scholar
Boulanger-Lewandowski, N., Bengio, Y., Vincent, P.: High-dimensional sequence transduction. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3178–3182 (2013)
Google Scholar
Wang, Y., Deng, X., Pu, S., Huang, Z.: Residual convolutional CTC networks for automatic speech recognition (2017). preprint: arXiv:1702.07793. https://arxiv.org/abs/1702.07793
Rustamov, S., Gasimov, E., Hasanov, R., Jahangirli, S., Mustafayev, E., Usikov, D.: Speech recognition in flight simulator. In: Aegean International Textile and Advanced Engineering Conference. IOP Conference Series: Materials Science and Engineering, vol. 459 (2018)
Google Scholar
Gulmira, T., Alymzhan, T., Orken, M., Rustam, M.: Neural named entity recognition for Kazakh. In: 20th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing), 7–13 April 2019, La Rochelle, France. Lecture Notes in Computer Science (2019)
Google Scholar
Toleu, A., Tolegen, G., Makazhanov, A.: Character-aware neural morphological disambiguation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 666–671. Association for Computational Linguistics, Vancouver (2017)
Google Scholar

Download references

Acknowledgments

This work was supported by the Ministry of Education and Science of the Republic of Kazakhstan. IRN AP05131207 Development of technologies for multilingual automatic speech recognition using deep neural networks.

Author information

Authors and Affiliations

Institute of Information and Computational Technologies, Almaty, 050010, Kazakhstan
Orken Mamyrbayev, Bagashar Zhumazhanov & Tolganay Turdalykyzy
Tokyo Denki University, Tokyo, 120-8551, Japan
Keylan Alimhan
Al-Farabi, Kazakh National University, Almaty, 050040, Kazakhstan
Orken Mamyrbayev & Farida Gusmanova

Authors

Orken Mamyrbayev
View author publications
You can also search for this author in PubMed Google Scholar
Keylan Alimhan
View author publications
You can also search for this author in PubMed Google Scholar
Bagashar Zhumazhanov
View author publications
You can also search for this author in PubMed Google Scholar
Tolganay Turdalykyzy
View author publications
You can also search for this author in PubMed Google Scholar
Farida Gusmanova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Orken Mamyrbayev .

Editor information

Editors and Affiliations

Department of Applied Informatics, Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand
Kietikul Jearanaitanakij
Faculty of Computer Science and Information, University Teknologi Malaysia, Kuala Lumpur, Malaysia
Ali Selamat
Department of Applied Informatics, Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński
King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand
Suphamit Chittayasothorn

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mamyrbayev, O., Alimhan, K., Zhumazhanov, B., Turdalykyzy, T., Gusmanova, F. (2020). End-to-End Speech Recognition in Agglutinative Languages. In: Nguyen, N., Jearanaitanakij, K., Selamat, A., Trawiński, B., Chittayasothorn, S. (eds) Intelligent Information and Database Systems. ACIIDS 2020. Lecture Notes in Computer Science(), vol 12034. Springer, Cham. https://doi.org/10.1007/978-3-030-42058-1_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-42058-1_33
Published: 04 March 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-42057-4
Online ISBN: 978-3-030-42058-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics