Abstract
In the ubiquitous (pervasive) computing era, it is expected that everybody will access information services anytime anywhere, and these services are expected to augment various human intelligent activities. Speech recognition technology can play an important role in this era by providing: (a) conversational systems for accessing information services and (b) systems for transcribing, understanding and summarising ubiquitous speech documents such as meetings, lectures, presentations and voicemails. In the former systems, robust conversation using wireless handheld/hands-free devices in the real mobile computing environment will be crucial, as will multimodal speech recognition technology. To create the latter systems, the ability to understand and summarise speech documents is one of the key requirements. This chapter presents technological perspectives and introduces several research activities being conducted from these standpoints.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barron, J. L., Fleet, D. J., and Beauchemin, S. S. (1994). Performance of optical flow techniques. International Journal of Computer Vision, 12(1):43–77.
Basu, S., Neti, C., Rajput, N., Senior, A., Subramaniam, L., and Verma, A. (1999). Audio-visual large vocabulary continuous speech recognition in the broadcast domain. In Proceedings of International Workshop on Multimedia Signal Processing (MMSP), pages 475–481, Copenhagen, Denmark.
Boves, L. and Oostdijk, N. (2003). Spontaneous speech in the spoken Dutch corpus. In Proceedings of IEEE International Workshop on Spontaneous Speech Processing and Recognition, pages 171–174, Tokyo, Japan.
Furui, S. (1997). Recent advances in robust speech recognition. In Proceedings of ESCA-NATO Workshop on Robust Speech Recognition for Unknown Communication Channels, pages 11–20, Pont-a-Mousson, France.
Furui, S. (2000a). Digital Speech Processing, Synthesis, and Recognition. Marcel Dekker, New York, New York, USA, 2nd edition.
Furui, S. (2000b). Speech recognition technology in the ubiquitous/wearable computing environment. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3735–3738, Istanbul, Turkey.
Furui, S. (2000c). Steps toward natural human-machine communication in the 21st century. In Proceedings of COST249 Workshop on Voice Operated Telecom Services, pages 17–24, Gent, Belgium.
Furui, S. (2001). Toward flexible speech recognition — recent progress at Tokyo Institute of Technology. In Proceedings of Canadian Conference on Electrical and Computer Engineering (CCECE), Toronto, Canada.
Furui, S. and Itoh, D. (2001). Neural networks-based HMM adaptation for noisy speech. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 365–368, Salt Lake City, Utah, USA.
Furui, S., Maekawa, K., Isahara, H., Shinozaki, T., and Ohdaira, T. (2000). Toward the realization of spontaneous speech recognition — Introduction of a Japanese priority program and preliminary results. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 518–521, Beijing, China.
Gales, M. J. F. and Young, S. J. (1992). An improved approach to the hidden Markov model decomposition of speech and noise. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 233–236, San Francisco, California, USA.
Gauvain, J.-L. and Lee, C.-H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2):291–298.
Hansen, J. H. L. (1996). Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition. Speech Communication, Special issue on speech under stress, 20(1–2):151–173.
Hori, C. and Furui, S. (2000). Automatic speech summarization based on word significance and linguistic likelihood. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1579–1582, Instanbul, Turkey.
Hori, C., Furui, S., Malkin, R., Yu, H., and Waibel, A. (2002). Automatic speech summarization applied to english broadcast news speech. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9–12, Orlando, Florida, USA.
Juang, B.-H. and Furui, S. (2000). Automatic recognition and understanding of spoken language — A first step to wards natural human-machine communication. Proceedings of IEEE, 88(8):1142–1165.
Leggetter, C. J. and Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density Hidden Markov Models. Computer Speech and Language, 9:171–185.
Levin, E., Narayanan, S., Pieraccini, R., Biatov, K., Bocchieri, E., Fabbrizio, G. D., Eckert, W., Lee, S., Pokrovsky, A., Rahim, M., Ruscitti, P., and Walker, M. (2000). The AT&T-DARPA Communicator mixed-initiative spoken dialogue system. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 122–125, Beijing, China.
Marsic, I., Medl, A., and Flanagan, J. (2000). Natural communication with information systems. Proceedings of IEEE, 88(8):1354–1366.
Martin, F., Shikano, K., and Minami, Y. (1993). Recognition of noisy speech by composition of hidden Markov models. In Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), pages 1031–1034, Berlin, Germany.
Mase, K. and Pentland, A. (1991). Automatic lipreading by optical-flow analysis. Transactions Systems and Computers in Japan, 22(6):67–76.
Ney, H. (1997). Corpus-based statistical methods in speech and language processing. In Young, S. and Bloothooft, G., editors, Corpus-based Methods in Language and Speech Processing, pages 1–26. Kluwer Academic Publishers, Dordrecht, The Netherlands.
Ohkura, K., Sugiyama, M., and Sagayama, S. (1992). Speaker adaptation based on transfer vector field smoothing with continuous mixture density hmms. In Proceedings of International Conference on Spoken Language Processing (ICSLP), pages 369–372, Banff, Canada.
Pentland, A. (1998). Wearable intelligence. Scientific American, 276(11).
Rabiner, L. R. and Juang, B. H. (1993). Fundamentals of Speech Recognition. Prentice-Hall, Inc., New Jersey, USA.
Rigoll, G. (2003). An overview on European projects related to spontaneous speech recognition. In Proceedings of IEEE International Workshop on Spontaneous Speech Processing and Recognition, pages 131–134, Tokyo, Japan.
Sankar, A. and Lee, C.-H. (1996). A maximum-likelihood approach to stochastic matching for robust speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 4(3):190–202.
Sarikaya, R. and Hansen, J. H. L. (2000). PCA-PMC: A novel use of a priori knowledge for fast parallel model combination. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1113–1116, Instanbul, Turkey.
Shinozaki, T. and Furui, S. (2002). Analysis on individual differences in automatic transcription of spontaneous presentations. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 729–732, Orlando, Florida, USA.
Shinozaki, T., Hori, C., and Furui, S. (2001). Towards automatic transcription of spontaneous presentations. In Proceedings of European Conference on Speech Communication and Technology (EUROSPEECH), pages 491–494, Aalborg, Denmark.
Shinozaki, T., Saito, Y., Hori, C., and Furui, S. (2000). Toward spontaneous speech recognition. IEICE/ASJ Technical Report, SP2000-96.
Taguma, R., Moriyama, T., Iwano, K., and Furui, S. (2002). Parallel computing-based architecture for mixed-initiative spoken dialogue. In Proceedings of IEEE International Conference on Multimodal Interfaces, pages 53–58, Pittsburgh, Pennsylvania, USA.
Tamura, S., Iwano, K., and Furui, S. (2004). A robust multimodal speech recognition method using optical flow analysis. In Minker, W., Bühler, D., and Dybkjær, L., editors, Spoken Multimodal Human-Computer Dialogue in Mobile Environments. Kluwer Academic Publishers, Dordrecht, The Netherlands. (this volume).
Varga, A. P. and Moore, R. K. (1990). Hidden Markov Model decomposition of speech and noise. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 845–848, Albuquerque, New Mexico, USA.
Weiser, M. (1991). The computer for the twenty-first century. Scientific American, 265(3):94–104.
Zhang, Z.-P. and Furui, S. (2000). On-line incremental speaker adaptation with automatic speaker change detection. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 961–964, Instanbul, Turkey.
Zhang, Z.-P. and Furui, S. (2001). Piecewise-linear transformation-based HMM adaptation for noisy speech. In Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 43–58, Madonna di Campiglio, Italy.
Zhang, Z.-P., Furui, S., and Ohtsuki, K. (2002). On-line incremental speaker adaptation for broadcast news transcription. Speech Communication, 37(3–4):271–281.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer
About this chapter
Cite this chapter
Furui, S. (2005). Speech Recognition Technology in Multimodal/Ubiquitous Computing Environments. In: Minker, W., Bühler, D., Dybkjær, L. (eds) Spoken Multimodal Human-Computer Dialogue in Mobile Environments. Text, Speech and Language Technology, vol 28. Springer, Dordrecht. https://doi.org/10.1007/1-4020-3075-4_2
Download citation
DOI: https://doi.org/10.1007/1-4020-3075-4_2
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-3073-4
Online ISBN: 978-1-4020-3075-8
eBook Packages: Computer ScienceComputer Science (R0)