Skip to main content

Abstract

Automatic speech recognition (ASR) is a critical component for CHIL services. For example, it provides the input to higher-level technologies, such as summarization and question answering, as discussed in Chapter 8. In the spirit of ubiquitous computing, the goal of ASR in CHIL is to achieve a high performance using far-field sensors (networks of microphone arrays and distributed far-field microphones). However, close-talking microphones are also of interest, as they are used to benchmark ASR system development by providing a best-case acoustic channel scenario to compare against.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. AMI – Augmented Multiparty Interaction, http://www.amiproject.org.

  2. T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul. A compact model for speaker adaptation training. In International Conference on Spoken Language Processing (ICSLP), pages 1137–1140, Philadelphia, PA, 1996.

    Google Scholar 

  3. A. Andreou, T. Kamm, and J. Cohen. Experiments in vocal tract normalisation. In Proceedings of the CAIP Workshop: Frontiers in Speech Recognition II, 1994.

    Google Scholar 

  4. X. Anguera, C. Wooters, and J. Hernando. Acoustic beamforming for speaker diarization of meetings. IEEE Transactions on Audio, Speech, and Language Processing, 15(7):2011–2022, 2007.

    Article  Google Scholar 

  5. C. Barras, X. Zhu, S. Meignier, and J.-L. Gauvain. Multi-stage speaker diarization of Broadcast News. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1505–1512, 2006.

    Article  Google Scholar 

  6. R. E. Bellman. Adaptive Control Processes. Princeton University Press, Princeton, NJ, 1961.

    Google Scholar 

  7. S. Burger, V. McLaren, and H. Yu. The ISL meeting corpus: The impact of meeting type on speech style. In Proceedings of the International Conference on Spoken Language Processing, Denver, CO, 2002.

    Google Scholar 

  8. S. M. Chu, E. Marcheret, and G. Potamianos. Automatic speech recognition and speech activity detection in the CHIL seminar room. In Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI), pages 332–343, Edinburgh, United Kingdom, 2005.

    Google Scholar 

  9. S. Davis and P. Mermelstein. Comparison of parametric representations of monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366, 1980.

    Article  Google Scholar 

  10. A. P. Dempster, M. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 39:1–38, 1977.

    MATH  MathSciNet  Google Scholar 

  11. F. Faubel and M. Wölfel. Coupling particle filters with automatic speech recognition for speech feature enhancement. In Proceedings of Interspeech, 2006.

    Google Scholar 

  12. J. G. Fiscus. A post-processing system to yield reduced word error rates: Recogniser output voting error reduction (ROVER). In Automatic Speech Recognition and Understanding Workshop (ASRU), pages 347–352, Santa Barbara, CA, 1997.

    Google Scholar 

  13. J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo. The Rich Transcription 2006 Spring meeting recognition evaluation. In S. Renals, S. Bengio, and J. G. Fiscus, editors, Machine Learning for Multimodal Interaction, LNCS 4299, pages 309–322. 2006.

    Google Scholar 

  14. J. S. Garofolo, C. D. Laprun, M. Michel, V. M. Stanford, and E. Tabassi. The NIST meeting room pilot corpus. In Proceedings of the Language Resources Evaluation Conference, Lisbon, Portugal, May 2004.

    Google Scholar 

  15. J.-L. Gauvain and C. Lee. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2):291–298, Apr. 1994.

    Article  Google Scholar 

  16. R. Gopinath. Maximum likelihood modeling with Gaussian distributions for classification. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 661–664, Seattle, WA, 1998.

    Google Scholar 

  17. R. Haeb-Umbach and H. Ney. Linear discriminant analysis for improved large vocabulary continuous speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 13–16, 1992.

    Google Scholar 

  18. H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustic Society of America, 87(4):1738–1752, 1990.

    Article  Google Scholar 

  19. R. Hincks. Computer Support for Learners of Spoken English. PhD thesis, KTH, Stockholm, Sweden, 2005.

    Google Scholar 

  20. J. Huang, E. Marcheret, and K. Visweswariah. Improving speaker diarization for CHIL lecture meetings. In Proceedings of Interspeech, pages 1865–1868, Antwerp, Belgium, 2007.

    Google Scholar 

  21. J. Huang, E. Marcheret, K. Visweswariah, V. Libal, and G. Potamianos. The IBM Rich Transcription 2007 speech-to-text systems for lecture meetings. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 429–441, Baltimore, MD, May 8-11 2007.

    Google Scholar 

  22. J. Huang, E. Marcheret, K. Visweswariah, and G. Potamianos. The IBM RT07 evaluation systems for speaker diarization on lecture meetings. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 497–508, Baltimore, MD, May 8-11 2007.

    Google Scholar 

  23. J. Huang, M. Westphal, S. Chen, O. Siohan, D. Povey, V. Libal, A. Soneiro, H. Schulz, T. Ross, and G. Potamianos. The IBM Rich Transcription Spring 2006 speech-to-text system for lecture meetings. In Machine Learning for Multimodal Interaction, pages 432–443. LNCS 4299, 2006.

    Google Scholar 

  24. A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, and C.Wooters. The ICSI meeting corpus. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, 2003.

    Google Scholar 

  25. N. S. Kim. Feature domain compensation of nonstationary noise for robust speech recognition. Speech Communication, 37:231–248, 2002.

    Article  MATH  Google Scholar 

  26. K. Kumatani, T. Gehrig, U. Mayer, E. Stoimenov, J. McDonough, and M. Wölfel. Adaptive beamforming with a minimum mutual information criterion. IEEE Transactions on Audio, Speech, and Language Processing, 15:2527–2541, 2007.

    Article  Google Scholar 

  27. K. Kumatani, S. Nakamura, and R. Stiefelhagen. Asynchronous event modeling algorithm for bimodal speech recognition. Speech Communication, 2008. (submitted to).

    Google Scholar 

  28. K. Kumatani and R. Stiefelhagen. State-synchronous modeling on phone boundary for audio visual speech recognition and application to multi-view face images. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 417–420, Honolulu, HI, 2007.

    Google Scholar 

  29. L. Lamel, J. L. Gauvain, and G. Adda. Lightly supervised and unsupervised acoustic model training. Computer, Speech and Language, 16(1):115–229, 2002.

    Article  Google Scholar 

  30. H. Tillmann. The translanguage English database (TED). In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Yokohama, Japan, 1994.

    Google Scholar 

  31. The LDC Corpus Catalog. http://www.ldc.upenn.edu/Catalog.

  32. C. J. Leggetter and P. C. Woodland. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language, 9(2):171–185, 1995.

    Article  Google Scholar 

  33. P. Lucey, G. Potamianos, and S. Sridharan. A unified approach to multi-pose audio-visual ASR. In Interspeech, Antwerp, Belgium, 2007.

    Google Scholar 

  34. J. Luque, X. Anguera, A. Temko, and J. Hernando. Speaker diarization for conference room: The UPC RT07s evaluation system. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 543–554, Baltimore, MD, May 8-11 2007.

    Google Scholar 

  35. D. Macho, C. Nadeu, and A. Temko. Robust speech activity detection in interactive smartroom environments. In Machine Learning for Multimodal Interaction, LNCS 4299, pages 236–247. 2006.

    Google Scholar 

  36. L. Mangu, E. Brill, and A. Stolcke. Finding consensus in speech recognition: Word error minimization and other applications of confusion networks. Computer, Speech and Lanuage, 14(4):373–400, 2000.

    Article  Google Scholar 

  37. E. Marcheret, V. Libal, and G. Potamianos. Dynamic stream weight modeling for audiovisual speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 4, pages 945–948, Honolulu, HI, 2007.

    Google Scholar 

  38. E. Marcheret, G. Potamianos, K. Visweswariah, and J. Huang. The IBM RT06s evaluation system for speech activity detection in CHIL seminars. In Machine Learning for Multimodal Interaction, LNCS 4299, pages 323–335. 2006.

    Google Scholar 

  39. J. M. Pardo, X. Anguera, and C. Wooters. Speaker diarization for multi-microphone meetings using only between-channel differences. In Machine Learning for Multimodal Interaction, pages 257–264. LNCS 4299, 2006.

    Google Scholar 

  40. D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, and G. Zweig. fMPE: Discriminatively trained features for speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 961–964, Philadelphia, PA, 2005.

    Google Scholar 

  41. D. Povey and P. Woodland. Improved discriminative training techniques for large vocabulary continuous speech recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Salt Lake City, UT, 2001.

    Google Scholar 

  42. D. Povey and P. C. Woodland. Minimum phone error and I-smoothing for improved discriminative training. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 105–108, Orlando, FL, 2002.

    Google Scholar 

  43. E. Rentzeperis, A. Stergiou, C. Boukis, A. Pnevmatikakis, and L. C. Polymenakos. The 2006 Athens Information Technology speech activity detection and speaker diarization systems. In Machine Learning for Multimodal Interaction, pages 385–395. LNCS 4299, 2006.

    Google Scholar 

  44. C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, New York, 2nd edition, 2004.

    MATH  Google Scholar 

  45. Rich Transcription 2007 Meeting Recognition Evaluation. http://www.nist.gov/speech/tests/rt/rt2007.

  46. G. Saon, G. Zweig, and M. Padmanabhan. Linear feature space projections for speaker adaptation. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 325–328, Salt Lake City, UT, 2001.

    Google Scholar 

  47. H. Schwenk. Efficient training of large neural networks for language modeling. In Proceedings of the International Joint Conference on Neural Networks, pages 3059–3062, 2004.

    Google Scholar 

  48. R. Singh and B. Raj. Tracking noise via dynamical systems with a continuum of states. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong, 2003.

    Google Scholar 

  49. O. Siohan, B. Ramabhadran, and B. Kingsbury. Constructing ensembles of ASR systems using randomized decision trees. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages 197–200, Philadelphia, PA, 2005.

    Google Scholar 

  50. The Translanguage English Database (TED) Transcripts (LDC catalog number LDC2002T03, ISBN 1-58563-202-3).

    Google Scholar 

  51. S. E. Tranter and D. A. Reynolds. An overview of automatic speaker diarization systems. IEEE Transactions on Audio, Speech, and Language Processing, 14(5):1557–1565, 2004.

    Article  Google Scholar 

  52. M. Wölfel. Warped-twice minimum variance distortionless response spectral estimation. In Proceedings of the European Signal Processing Conference (EUSIPCO), 2006.

    Google Scholar 

  53. M. Wölfel. Channel selection by class separability measures for automatic transcriptions on distant microphones. In Proceedings of Interspeech, 2007.

    Google Scholar 

  54. M. Wölfel. A joint particle filter and multi-step linear prediction framework to provide enhanced speech features prior to automatic recognition. In Joint Workshop on Handsfree Speech Communication and Microphone Arrays (HSCMA), Trento, Italy, 2008.

    Google Scholar 

  55. M. Wölfel and J. McDonough. Combining multi-source far distance speech recognition strategies: Beamforming, blind channel and confusion network combination. In Proceedings of Interspeech, 2005.

    Google Scholar 

  56. M. Wölfel and J. W. McDonough. Minimum variance distortionless response spectral estimation, review and refinements. IEEE Signal Processing Magazine, 22(5):117–126, 2005.

    Article  Google Scholar 

  57. M. Wölfel, K. Nickel, and J. McDonough. Microphone array driven speech recognition: influence of localization on the word error rate. Proceedings of the Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI), 2005.

    Google Scholar 

  58. M. Wölfel, S. Stüker, and F. Kraft. The ISL RT-07 speech-to-text system. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 464–474, Baltimore, MD, May 8-11 2007.

    Google Scholar 

  59. X. Zhu, C. Barras, L. Lamel, and J. L. Gauvain. Speaker diarization: from Broadcast News to lectures. In Machine Learning for Multimodal Interaction, pages 396–406. LNCS 4299, 2006.

    Google Scholar 

  60. X. Zhu, C. Barras, L. Lamel, and J.-L. Gauvain. Multi-stage speaker diarization for conference and lecture meetings. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, LNCS 4625, pages 533–542, Baltimore, MD, May 8-11 2007.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag London Limited

About this chapter

Cite this chapter

Potamianos, G. et al. (2009). Automatic Speech Recognition. In: Waibel, A., Stiefelhagen, R. (eds) Computers in the Human Interaction Loop. Human–Computer Interaction Series. Springer, London. https://doi.org/10.1007/978-1-84882-054-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-84882-054-8_6

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84882-053-1

  • Online ISBN: 978-1-84882-054-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics