Speech Recognition Technology in Multimodal/Ubiquitous Computing Environments

Furui, Sadaoki

doi:10.1007/1-4020-3075-4_2

Sadaoki Furui³

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 28))

483 Accesses

Abstract

In the ubiquitous (pervasive) computing era, it is expected that everybody will access information services anytime anywhere, and these services are expected to augment various human intelligent activities. Speech recognition technology can play an important role in this era by providing: (a) conversational systems for accessing information services and (b) systems for transcribing, understanding and summarising ubiquitous speech documents such as meetings, lectures, presentations and voicemails. In the former systems, robust conversation using wireless handheld/hands-free devices in the real mobile computing environment will be crucial, as will multimodal speech recognition technology. To create the latter systems, the ability to understand and summarise speech documents is one of the key requirements. This chapter presents technological perspectives and introduces several research activities being conducted from these standpoints.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Barron, J. L., Fleet, D. J., and Beauchemin, S. S. (1994). Performance of optical flow techniques. International Journal of Computer Vision, 12(1):43–77.
Article Google Scholar
Basu, S., Neti, C., Rajput, N., Senior, A., Subramaniam, L., and Verma, A. (1999). Audio-visual large vocabulary continuous speech recognition in the broadcast domain. In Proceedings of International Workshop on Multimedia Signal Processing (MMSP), pages 475–481, Copenhagen, Denmark.
Google Scholar
Boves, L. and Oostdijk, N. (2003). Spontaneous speech in the spoken Dutch corpus. In Proceedings of IEEE International Workshop on Spontaneous Speech Processing and Recognition, pages 171–174, Tokyo, Japan.
Google Scholar
Furui, S. (1997). Recent advances in robust speech recognition. In Proceedings of ESCA-NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, pages 11–20, Pont-a-Mousson, France.
Google Scholar
Furui, S. (2000a). Digital Speech Processing, Synthesis, and Recognition. Marcel Dekker, New York, New York, USA, 2nd edition.
Google Scholar
Furui, S. (2000b). Speech recognition technology in the ubiquitous/wearable computing environment. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3735–3738, Istanbul, Turkey.
Google Scholar
Furui, S. (2000c). Steps toward natural human-machine communication in the 21st century. In Proceedings of COST249 Workshop on Voice Operated Telecom Services, pages 17–24, Gent, Belgium.
Google Scholar
Furui, S. (2001). Toward flexible speech recognition — recent progress at Tokyo Institute of Technology. In Proceedings of Canadian Conference on Electrical and Computer Engineering (CCECE), Toronto, Canada.
Google Scholar
Furui, S. and Itoh, D. (2001). Neural networks-based HMM adaptation for noisy speech. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 365–368, Salt Lake City, Utah, USA.
Google Scholar
Furui, S., Maekawa, K., Isahara, H., Shinozaki, T., and Ohdaira, T. (2000). Toward the realization of spontaneous speech recognition — Introduction of a Japanese priority program and preliminary results. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 518–521, Beijing, China.
Google Scholar
Gales, M. J. F. and Young, S. J. (1992). An improved approach to the hidden Markov model decomposition of speech and noise. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 233–236, San Francisco, California, USA.
Google Scholar
Gauvain, J.-L. and Lee, C.-H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2):291–298.
Article Google Scholar
Hansen, J. H. L. (1996). Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition. Speech Communication, Special issue on speech under stress, 20(1–2):151–173.
Google Scholar
Hori, C. and Furui, S. (2000). Automatic speech summarization based on word significance and linguistic likelihood. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1579–1582, Instanbul, Turkey.
Google Scholar
Hori, C., Furui, S., Malkin, R., Yu, H., and Waibel, A. (2002). Automatic speech summarization applied to english broadcast news speech. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9–12, Orlando, Florida, USA.
Google Scholar
Juang, B.-H. and Furui, S. (2000). Automatic recognition and understanding of spoken language — A first step to wards natural human-machine communication. Proceedings of IEEE, 88(8):1142–1165.
Article Google Scholar
Leggetter, C. J. and Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov Models. Computer Speech and Language, 9:171–185.
Article Google Scholar
Levin, E., Narayanan, S., Pieraccini, R., Biatov, K., Bocchieri, E., Fabbrizio, G. D., Eckert, W., Lee, S., Pokrovsky, A., Rahim, M., Ruscitti, P., and Walker, M. (2000). The AT&T-DARPA Communicator mixed-initiative spoken dialogue system. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 122–125, Beijing, China.
Google Scholar
Marsic, I., Medl, A., and Flanagan, J. (2000). Natural communication with information systems. Proceedings of IEEE, 88(8):1354–1366.
Article Google Scholar
Martin, F., Shikano, K., and Minami, Y. (1993). Recognition of noisy speech by composition of hidden Markov models. In Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), pages 1031–1034, Berlin, Germany.
Google Scholar
Mase, K. and Pentland, A. (1991). Automatic lipreading by optical-flow analysis. Transactions Systems and Computers in Japan, 22(6):67–76.
Article Google Scholar
Ney, H. (1997). Corpus-based statistical methods in speech and language processing. In Young, S. and Bloothooft, G., editors, Corpus-based Methods in Language and Speech Processing, pages 1–26. Kluwer Academic Publishers, Dordrecht, The Netherlands.
MATH Google Scholar
Ohkura, K., Sugiyama, M., and Sagayama, S. (1992). Speaker adaptation based on transfer vector field smoothing with continuous mixture density hmms. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 369–372, Banff, Canada.
Google Scholar
Pentland, A. (1998). Wearable intelligence. Scientific American, 276(11).
Google Scholar
Rabiner, L. R. and Juang, B. H. (1993). Fundamentals of Speech Recognition. Prentice-Hall, Inc., New Jersey, USA.
Google Scholar
Rigoll, G. (2003). An overview on European projects related to spontaneous speech recognition. In Proceedings of IEEE International Workshop on Spontaneous Speech Processing and Recognition, pages 131–134, Tokyo, Japan.
Google Scholar
Sankar, A. and Lee, C.-H. (1996). A maximum-likelihood approach to stochastic matching for robust speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 4(3):190–202.
Google Scholar
Sarikaya, R. and Hansen, J. H. L. (2000). PCA-PMC: A novel use of a priori knowledge for fast parallel model combination. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1113–1116, Instanbul, Turkey.
Google Scholar
Shinozaki, T. and Furui, S. (2002). Analysis on individual differences in automatic transcription of spontaneous presentations. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 729–732, Orlando, Florida, USA.
Google Scholar
Shinozaki, T., Hori, C., and Furui, S. (2001). Towards automatic transcription of spontaneous presentations. In Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), pages 491–494, Aalborg, Denmark.
Google Scholar
Shinozaki, T., Saito, Y., Hori, C., and Furui, S. (2000). Toward spontaneous speech recognition. IEICE/ASJ Technical Report, SP2000-96.
Google Scholar
Taguma, R., Moriyama, T., Iwano, K., and Furui, S. (2002). Parallel computing-based architecture for mixed-initiative spoken dialogue. In Proceedings of IEEE International Conference on Multimodal Interfaces, pages 53–58, Pittsburgh, Pennsylvania, USA.
Google Scholar
Tamura, S., Iwano, K., and Furui, S. (2004). A robust multimodal speech recognition method using optical flow analysis. In Minker, W., Bühler, D., and Dybkjær, L., editors, Spoken Multimodal Human-Computer Dialogue in Mobile Environments. Kluwer Academic Publishers, Dordrecht, The Netherlands. (this volume).
Google Scholar
Varga, A. P. and Moore, R. K. (1990). Hidden Markov Model decomposition of speech and noise. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 845–848, Albuquerque, New Mexico, USA.
Google Scholar
Weiser, M. (1991). The computer for the twenty-first century. Scientific American, 265(3):94–104.
Article Google Scholar
Zhang, Z.-P. and Furui, S. (2000). On-line incremental speaker adaptation with automatic speaker change detection. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 961–964, Instanbul, Turkey.
Google Scholar
Zhang, Z.-P. and Furui, S. (2001). Piecewise-linear transformation-based HMM adaptation for noisy speech. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 43–58, Madonna di Campiglio, Italy.
Google Scholar
Zhang, Z.-P., Furui, S., and Ohtsuki, K. (2002). On-line incremental speaker adaptation for broadcast news transcription. Speech Communication, 37(3–4):271–281.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan
Sadaoki Furui

Authors

Sadaoki Furui
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Ulm, Germany
W. Minker & Dirk Bühler &
University of Southern Denmark, Odense, Denmark
Laila Dybkjær

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Furui, S. (2005). Speech Recognition Technology in Multimodal/Ubiquitous Computing Environments. In: Minker, W., Bühler, D., Dybkjær, L. (eds) Spoken Multimodal Human-Computer Dialogue in Mobile Environments. Text, Speech and Language Technology, vol 28. Springer, Dordrecht. https://doi.org/10.1007/1-4020-3075-4_2

Download citation

DOI: https://doi.org/10.1007/1-4020-3075-4_2
Published: 17 August 2005
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-3073-4
Online ISBN: 978-1-4020-3075-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics