Speech Recognition with Multi-modal Features Based on Neural Networks

Kim, Myung Won; Ryu, Joung Woo; Kim, Eun Ju

doi:10.1007/11893257_55

Myung Won Kim²⁰,
Joung Woo Ryu²¹ &
Eun Ju Kim²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4233))

Included in the following conference series:

International Conference on Neural Information Processing

1342 Accesses
2 Citations

Abstract

Recent researches have been focusing on fusion of audio and visual features for reliable speech recognition in noisy environments. In this paper, we propose a neural network based model of robust speech recognition by integrating audio, visual, and contextual information. Bimodal Neural Network (BMNN) is a multi-layer perceptron of 4 layers, which combines audio and visual features of speech to compensate loss of audio information caused by noise. In order to improve the accuracy of speech recognition in noisy environments, we also propose a post-processing based on contextual information which are sequential patterns of words spoken by a user. Our experimental results show that our model outperforms any single mode models. Particularly, when we use the contextual information, we can obtain over 90% recognition accuracy even in noisy environments, which is a significant improvement compared with the state of art in speech recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chibelushi, C.C., Deravi, F., Mason, J.S.D.: A Review of Speech-Based Bimodal Recognition. IEEE Transactions on Multimedia 4(1), 23–37 (2002)
Article Google Scholar
Kaynak, M.N., Zhi, Q., Cheok, A.D., Sengupta, K., Chung, K.C.: Audio-visual modeling for bimodal speech recognition. Proceedings of the IEEE Systems, Man, and Cybernetics Conference 1, 181–186 (2001)
Google Scholar
Gemello, R., Albesano, D., Mana, F., Moisa, L.: Multi-source neural networks for speech recognition: a review of recent results. In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, vol. 5, pp. 265–270 (2000)
Google Scholar
Zhang, X., Merserratt, R.M., Clements, M.: Bimodal fusion in audio-visual speech recognition. In: International Conference on Image Processing, vol. 1, pp. 964–967 (2002)
Google Scholar
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme Recognition Using Time-Delay Neural Networks. IEEE Trans. on Acoustics, Speech and Signal Processing 37(3), 328–339 (1989)
Article Google Scholar
Haffiner, P., Waibel, A.: Multi-State Time Delay Neural Networks for Continuous Speech Recognition. In: Advances in Neural Information Processing Systems 4, Morgan Kaufmann, San Francisco (1992)
Google Scholar
Tebelskis, J.: Speech Recognition using Neural Networks. CMU-CS-95-142, School of Computer Science Carnegie Mellon University Pittsburgh (1995)
Google Scholar
Bregler, C., Manke, S., Hild, H., Waibel, A.: Bimodal sensor integration on the example of speech-reading. In: Proc. of IEEE Int. Conf. on Neural Networks, San Francisco (1993)
Google Scholar
Kim, D.-S., Lee, S.-Y., Kil, R.M.: Auditory Processing of Speech Signals for Robust Speech Recognition in Real-World Noisy Environments. IEEE Trans. on Speech and Audio Processing 7(1), 55–69 (1999)
Article Google Scholar
Creaney-Stockton, M.J., Beng, MSc.: Isolated Word Recognition Using Reduced Connectivity Neural Networks With Non-Linear Time Alignment Methods. Dept. of Electrical and Electronic Engineering Univ. of Newcastle-Upon-Tyne (1996)
Google Scholar
Lee, S.W.: In Jung Park: A Study on Recognition of the Isolated Digits Using Integrated Processing of Speech-Image Information in Noisy Environments. Journal of the Institute of Electronics Engineers of Korea 38-CI(3), 61–67 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Soongsil University, 511, Sangdo-Dong, Dongjak-Gu, Seoul, Korea
Myung Won Kim & Eun Ju Kim
Intelligent Robot Research Division Electronics and Telecommunications Research Institute, 161 Gajeong-dong, Yuseong-gu, Daejeon, Korea
Joung Woo Ryu

Authors

Myung Won Kim
View author publications
You can also search for this author in PubMed Google Scholar
Joung Woo Ryu
View author publications
You can also search for this author in PubMed Google Scholar
Eun Ju Kim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Computer Science and Engineering, The Chinese Univ. of Hong Kong, Shatin, N.T., Hong Kong
Irwin King
Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong, China
Jun Wang
The Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Lai-Wan Chan
Department of Computer Science and Engineering & Center for Cognitive Science, The Ohio State University, OH 43210, Columbus
DeLiang Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, M.W., Ryu, J.W., Kim, E.J. (2006). Speech Recognition with Multi-modal Features Based on Neural Networks. In: King, I., Wang, J., Chan, LW., Wang, D. (eds) Neural Information Processing. ICONIP 2006. Lecture Notes in Computer Science, vol 4233. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893257_55

Download citation

DOI: https://doi.org/10.1007/11893257_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46481-5
Online ISBN: 978-3-540-46482-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics