Skip to main content

Speech Recognition with Multi-modal Features Based on Neural Networks

  • Conference paper
Neural Information Processing (ICONIP 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4233))

Included in the following conference series:

Abstract

Recent researches have been focusing on fusion of audio and visual features for reliable speech recognition in noisy environments. In this paper, we propose a neural network based model of robust speech recognition by integrating audio, visual, and contextual information. Bimodal Neural Network (BMNN) is a multi-layer perceptron of 4 layers, which combines audio and visual features of speech to compensate loss of audio information caused by noise.  In order to improve the accuracy of speech recognition in noisy environments, we also propose a post-processing based on contextual information which are sequential patterns of words spoken by a user. Our experimental results show that our model outperforms any single mode models. Particularly, when we use the contextual information, we can obtain over 90% recognition accuracy even in noisy environments, which is a significant improvement compared with the state of art in speech recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chibelushi, C.C., Deravi, F., Mason, J.S.D.: A Review of Speech-Based Bimodal Recognition. IEEE Transactions on Multimedia 4(1), 23–37 (2002)

    Article  Google Scholar 

  2. Kaynak, M.N., Zhi, Q., Cheok, A.D., Sengupta, K., Chung, K.C.: Audio-visual modeling for bimodal speech recognition. Proceedings of the IEEE Systems, Man, and Cybernetics Conference 1, 181–186 (2001)

    Google Scholar 

  3. Gemello, R., Albesano, D., Mana, F., Moisa, L.: Multi-source neural networks for speech recognition: a review of recent results. In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, vol. 5, pp. 265–270 (2000)

    Google Scholar 

  4. Zhang, X., Merserratt, R.M., Clements, M.: Bimodal fusion in audio-visual speech recognition. In: International Conference on Image Processing, vol. 1, pp. 964–967 (2002)

    Google Scholar 

  5. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme Recognition Using Time-Delay Neural Networks. IEEE Trans. on Acoustics, Speech and Signal Processing 37(3), 328–339 (1989)

    Article  Google Scholar 

  6. Haffiner, P., Waibel, A.: Multi-State Time Delay Neural Networks for Continuous Speech Recognition. In: Advances in Neural Information Processing Systems 4, Morgan Kaufmann, San Francisco (1992)

    Google Scholar 

  7. Tebelskis, J.: Speech Recognition using Neural Networks. CMU-CS-95-142, School of Computer Science Carnegie Mellon University Pittsburgh (1995)

    Google Scholar 

  8. Bregler, C., Manke, S., Hild, H., Waibel, A.: Bimodal sensor integration on the example of speech-reading. In: Proc. of IEEE Int. Conf. on Neural Networks, San Francisco (1993)

    Google Scholar 

  9. Kim, D.-S., Lee, S.-Y., Kil, R.M.: Auditory Processing of Speech Signals for Robust Speech Recognition in Real-World Noisy Environments. IEEE Trans. on Speech and Audio Processing 7(1), 55–69 (1999)

    Article  Google Scholar 

  10. Creaney-Stockton, M.J., Beng, MSc.: Isolated Word Recognition Using Reduced Connectivity Neural Networks With Non-Linear Time Alignment Methods. Dept. of Electrical and Electronic Engineering Univ. of Newcastle-Upon-Tyne (1996)

    Google Scholar 

  11. Lee, S.W.: In Jung Park: A Study on Recognition of the Isolated Digits Using Integrated Processing of Speech-Image Information in Noisy Environments. Journal of the Institute of Electronics Engineers of Korea 38-CI(3), 61–67 (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kim, M.W., Ryu, J.W., Kim, E.J. (2006). Speech Recognition with Multi-modal Features Based on Neural Networks. In: King, I., Wang, J., Chan, LW., Wang, D. (eds) Neural Information Processing. ICONIP 2006. Lecture Notes in Computer Science, vol 4233. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893257_55

Download citation

  • DOI: https://doi.org/10.1007/11893257_55

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-46481-5

  • Online ISBN: 978-3-540-46482-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics