Skip to main content

Speech Recognition Technology in Multimodal/Ubiquitous Computing Environments

  • Chapter
  • First Online:
Spoken Multimodal Human-Computer Dialogue in Mobile Environments

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 28))

  • 483 Accesses

Abstract

In the ubiquitous (pervasive) computing era, it is expected that everybody will access information services anytime anywhere, and these services are expected to augment various human intelligent activities. Speech recognition technology can play an important role in this era by providing: (a) conversational systems for accessing information services and (b) systems for transcribing, understanding and summarising ubiquitous speech documents such as meetings, lectures, presentations and voicemails. In the former systems, robust conversation using wireless handheld/hands-free devices in the real mobile computing environment will be crucial, as will multimodal speech recognition technology. To create the latter systems, the ability to understand and summarise speech documents is one of the key requirements. This chapter presents technological perspectives and introduces several research activities being conducted from these standpoints.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Barron, J. L., Fleet, D. J., and Beauchemin, S. S. (1994). Performance of optical flow techniques. International Journal of Computer Vision, 12(1):43–77.

    Article  Google Scholar 

  • Basu, S., Neti, C., Rajput, N., Senior, A., Subramaniam, L., and Verma, A. (1999). Audio-visual large vocabulary continuous speech recognition in the broadcast domain. In Proceedings of International Workshop on Multimedia Signal Processing (MMSP), pages 475–481, Copenhagen, Denmark.

    Google Scholar 

  • Boves, L. and Oostdijk, N. (2003). Spontaneous speech in the spoken Dutch corpus. In Proceedings of IEEE International Workshop on Spontaneous Speech Processing and Recognition, pages 171–174, Tokyo, Japan.

    Google Scholar 

  • Furui, S. (1997). Recent advances in robust speech recognition. In Proceedings of ESCA-NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, pages 11–20, Pont-a-Mousson, France.

    Google Scholar 

  • Furui, S. (2000a). Digital Speech Processing, Synthesis, and Recognition. Marcel Dekker, New York, New York, USA, 2nd edition.

    Google Scholar 

  • Furui, S. (2000b). Speech recognition technology in the ubiquitous/wearable computing environment. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3735–3738, Istanbul, Turkey.

    Google Scholar 

  • Furui, S. (2000c). Steps toward natural human-machine communication in the 21st century. In Proceedings of COST249 Workshop on Voice Operated Telecom Services, pages 17–24, Gent, Belgium.

    Google Scholar 

  • Furui, S. (2001). Toward flexible speech recognition — recent progress at Tokyo Institute of Technology. In Proceedings of Canadian Conference on Electrical and Computer Engineering (CCECE), Toronto, Canada.

    Google Scholar 

  • Furui, S. and Itoh, D. (2001). Neural networks-based HMM adaptation for noisy speech. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 365–368, Salt Lake City, Utah, USA.

    Google Scholar 

  • Furui, S., Maekawa, K., Isahara, H., Shinozaki, T., and Ohdaira, T. (2000). Toward the realization of spontaneous speech recognition — Introduction of a Japanese priority program and preliminary results. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 518–521, Beijing, China.

    Google Scholar 

  • Gales, M. J. F. and Young, S. J. (1992). An improved approach to the hidden Markov model decomposition of speech and noise. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 233–236, San Francisco, California, USA.

    Google Scholar 

  • Gauvain, J.-L. and Lee, C.-H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2):291–298.

    Article  Google Scholar 

  • Hansen, J. H. L. (1996). Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition. Speech Communication, Special issue on speech under stress, 20(1–2):151–173.

    Google Scholar 

  • Hori, C. and Furui, S. (2000). Automatic speech summarization based on word significance and linguistic likelihood. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1579–1582, Instanbul, Turkey.

    Google Scholar 

  • Hori, C., Furui, S., Malkin, R., Yu, H., and Waibel, A. (2002). Automatic speech summarization applied to english broadcast news speech. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9–12, Orlando, Florida, USA.

    Google Scholar 

  • Juang, B.-H. and Furui, S. (2000). Automatic recognition and understanding of spoken language — A first step to wards natural human-machine communication. Proceedings of IEEE, 88(8):1142–1165.

    Article  Google Scholar 

  • Leggetter, C. J. and Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov Models. Computer Speech and Language, 9:171–185.

    Article  Google Scholar 

  • Levin, E., Narayanan, S., Pieraccini, R., Biatov, K., Bocchieri, E., Fabbrizio, G. D., Eckert, W., Lee, S., Pokrovsky, A., Rahim, M., Ruscitti, P., and Walker, M. (2000). The AT&T-DARPA Communicator mixed-initiative spoken dialogue system. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 122–125, Beijing, China.

    Google Scholar 

  • Marsic, I., Medl, A., and Flanagan, J. (2000). Natural communication with information systems. Proceedings of IEEE, 88(8):1354–1366.

    Article  Google Scholar 

  • Martin, F., Shikano, K., and Minami, Y. (1993). Recognition of noisy speech by composition of hidden Markov models. In Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), pages 1031–1034, Berlin, Germany.

    Google Scholar 

  • Mase, K. and Pentland, A. (1991). Automatic lipreading by optical-flow analysis. Transactions Systems and Computers in Japan, 22(6):67–76.

    Article  Google Scholar 

  • Ney, H. (1997). Corpus-based statistical methods in speech and language processing. In Young, S. and Bloothooft, G., editors, Corpus-based Methods in Language and Speech Processing, pages 1–26. Kluwer Academic Publishers, Dordrecht, The Netherlands.

    MATH  Google Scholar 

  • Ohkura, K., Sugiyama, M., and Sagayama, S. (1992). Speaker adaptation based on transfer vector field smoothing with continuous mixture density hmms. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 369–372, Banff, Canada.

    Google Scholar 

  • Pentland, A. (1998). Wearable intelligence. Scientific American, 276(11).

    Google Scholar 

  • Rabiner, L. R. and Juang, B. H. (1993). Fundamentals of Speech Recognition. Prentice-Hall, Inc., New Jersey, USA.

    Google Scholar 

  • Rigoll, G. (2003). An overview on European projects related to spontaneous speech recognition. In Proceedings of IEEE International Workshop on Spontaneous Speech Processing and Recognition, pages 131–134, Tokyo, Japan.

    Google Scholar 

  • Sankar, A. and Lee, C.-H. (1996). A maximum-likelihood approach to stochastic matching for robust speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 4(3):190–202.

    Google Scholar 

  • Sarikaya, R. and Hansen, J. H. L. (2000). PCA-PMC: A novel use of a priori knowledge for fast parallel model combination. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1113–1116, Instanbul, Turkey.

    Google Scholar 

  • Shinozaki, T. and Furui, S. (2002). Analysis on individual differences in automatic transcription of spontaneous presentations. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 729–732, Orlando, Florida, USA.

    Google Scholar 

  • Shinozaki, T., Hori, C., and Furui, S. (2001). Towards automatic transcription of spontaneous presentations. In Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), pages 491–494, Aalborg, Denmark.

    Google Scholar 

  • Shinozaki, T., Saito, Y., Hori, C., and Furui, S. (2000). Toward spontaneous speech recognition. IEICE/ASJ Technical Report, SP2000-96.

    Google Scholar 

  • Taguma, R., Moriyama, T., Iwano, K., and Furui, S. (2002). Parallel computing-based architecture for mixed-initiative spoken dialogue. In Proceedings of IEEE International Conference on Multimodal Interfaces, pages 53–58, Pittsburgh, Pennsylvania, USA.

    Google Scholar 

  • Tamura, S., Iwano, K., and Furui, S. (2004). A robust multimodal speech recognition method using optical flow analysis. In Minker, W., Bühler, D., and Dybkjær, L., editors, Spoken Multimodal Human-Computer Dialogue in Mobile Environments. Kluwer Academic Publishers, Dordrecht, The Netherlands. (this volume).

    Google Scholar 

  • Varga, A. P. and Moore, R. K. (1990). Hidden Markov Model decomposition of speech and noise. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 845–848, Albuquerque, New Mexico, USA.

    Google Scholar 

  • Weiser, M. (1991). The computer for the twenty-first century. Scientific American, 265(3):94–104.

    Article  Google Scholar 

  • Zhang, Z.-P. and Furui, S. (2000). On-line incremental speaker adaptation with automatic speaker change detection. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 961–964, Instanbul, Turkey.

    Google Scholar 

  • Zhang, Z.-P. and Furui, S. (2001). Piecewise-linear transformation-based HMM adaptation for noisy speech. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 43–58, Madonna di Campiglio, Italy.

    Google Scholar 

  • Zhang, Z.-P., Furui, S., and Ohtsuki, K. (2002). On-line incremental speaker adaptation for broadcast news transcription. Speech Communication, 37(3–4):271–281.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer

About this chapter

Cite this chapter

Furui, S. (2005). Speech Recognition Technology in Multimodal/Ubiquitous Computing Environments. In: Minker, W., Bühler, D., Dybkjær, L. (eds) Spoken Multimodal Human-Computer Dialogue in Mobile Environments. Text, Speech and Language Technology, vol 28. Springer, Dordrecht. https://doi.org/10.1007/1-4020-3075-4_2

Download citation

  • DOI: https://doi.org/10.1007/1-4020-3075-4_2

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-1-4020-3073-4

  • Online ISBN: 978-1-4020-3075-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics