Abstract
Multi-modal interfaces for mobile applications include tiny screens, keyboards, touch screens, ear phones, microphones and software components for voice-based man-machine interaction. The software enabling voice recognition, as well as the microphone, are of primary importance in a noisy environment. Current performances of voice applications are reasonably good in quiet environment. However, the surrounding noise in many practical situations largely deteriorates the quality of the speech signal. As a consequence, the recognition rate decreases significantly. Noise management is a major focus in developing voice-enabled technologies. This project addresses the problem of voice recognition with the goal of reaching a high success rate (ideally above 99%) in an outdoor environment that is noisy and hostile: the user stands on an open deck of a motor-boat and use his/her voice to command applications running on a laptop by using a wireless microphone. In addition to the problem of noise, there are other constraints strongly limiting the hardware options. Furthermore, the user must also perform several tasks simultaneously. The success of the solution must rely on the efficiency and effectiveness of the voice recognition algorithm and the choice of the microphone. In addition, the training of the recognizer should be kept to a minimum and the recognition time should not last longer than 3 seconds. For these two reasons, only a limited set of voice commands have been tested.
A first demonstrator based on digit keyword spotting trained over phone speech showed poor performances in very noisy conditions. A second demonstrator combining neural network and template matching techniques lead to nearly acceptable results when the user recorded the keywords. Since the recognition rate was approximated around 90%, no additional field test was undertaken. This R&D project shows that state-of-the-art research on voice recognition needs further investigations in order to recognize spoken keywords in noisy environments. In addition to on-going improvements, unconventional research approaches that are worth testing include, deriving adapted keywords to specialized algorithms and having the user learn these keyword.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Vergyri, D., et al.: The SRI/OGI 2006 Spoken Term Detection System. In: Proc. of Interspeech (2007)
Miller, D., et al.: Rapid and Accurate Spoken Term Detection. In: Proc. of NIST Spoken Term Detection Workshop (STD 2006) (December 2006)
Szoke, I., et al.: Combination of Word and Phoneme Approach for Spoken Term Detection. In: 4th Joint Workshop on Machine Learning and Multimodal Interaction (2007)
James, D., Young, S.: A Fast Lattice-Based Approach to Vocabulary Independent Wordspotting. In: Proc. of IEEE Conf. Acoust. Speech. Signal Process. (ICASSP) (1994)
Szoke, I., et al.: Comparison of Keyword Spotting Approaches for Informal Continuous Speech. In: Proc. of Interspeech (2005)
Hermansky, H., Fousek, P., Lehtonen, M.: The Role of Speech in Multimodal Human-Computer Interaction (Towards Reliable Rejection of Non-Keyword Input). In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS, vol. 3658, pp. 2–8. Springer, Heidelberg (2005)
Wachter, M.D., Demuynck, K., Compernolle, D.V., Wambacq, P.: Data Driven Example Based Continuous Speech Recognition. In: Proceedings of Eurospeech, pp. 1133–1136 (2003)
Aradilla, G., Vepa, J., Bourlard, H.: Improving Speech Recognition Using a Data-Driven Approach. In: Proceedings of Interspeech, pp. 3333–3336 (2005)
Axelrod, S., Maison, B.: Combination of Hidden Markov Models with Dynamic Time Warping for Speech Recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. I, pp. 173–176 (2004)
Zhu, Q., Chen, B., Morgan, N., Stolcke, A.: On Using MLP features in LVCSR. In: Proceedings of International Conference on Spoken Language Processing (ICSLP) (2004)
Hermansky, H., Ellis, D., Sharma, S.: Tandem Connectionist Feature Extraction for Conventional HMM Systems. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2000)
Aradilla, G., Vepa, J., Bourlard, H.: Using Posterior-Based Features in Template Matching for Speech Recognition. In: Proceedings of International Conference on Spoken Language Processing (ICSLP) (2006)
Rabiner, L., Juang, B.H.: Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliffs (1993)
Aradilla, G., Vepa, J., Bourlard, H.: Using Pitch as Prior Knowledge in Template-Based Speech Recognition. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2006)
Niyogi, P., Sondhi, M.M.: Detecting Stop Consonants in Continuous Speech. The Journal of the Acoustic Society of America 111(2), 1063–1076 (2002)
Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Audio, Speech and Signal Processing 28, 357–366 (1980)
Hermansky, H.: Perceptual Linear Predictive (PLP) Analysis of Speech. The Journal of the Acoustic Society of America 87 (1990)
Wachter, M.D., Demuynck, K., Wambacq, P., Compernolle, D.V.: A Locally Weighted Distance Measure For Example Based Speech Recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 181–184 (2004)
Matton, M., Wachter, M.D., Compernolle, D.V., Cools, R.: A Discriminative Locally Weighted Distance Measure for Speaker Independent Template Based Speech Recognition. In: Proceedings of International Conference on Spoken Language Processing (ICSLP) (2004)
Cover, T.M., Thomas, J.A.: Information Theory. John Wiley, Chichester (1991)
Bhattacharyya, A.: On a Measure of Divergence between Two Statistical Populations Defined by their probability distributions. Bull. Calcutta Math. Soc. 35, 99–109 (1943)
Fukunaga, K.: Introduction to Statistical Pattern Recognition. Morgan Kaufmann, Academic Press (1990)
Mak, B., Barnard, E.: Phone Clustering Using the Bhattacharyya Distance. In: Proceedings of International Conference on Spoken Language Processing (ICSLP), pp. 2005–2008 (1996)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification. Wiley Interscience, Hoboken (2001)
Hermansky, H., Fousek, P.: Multi-Resolution RASTA Filtering for TANDEM-based ASR. In: Proceedings of Interspeech (2005)
Bourlard, H., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach, vol. 247. Kluwer Academic Publishers, Boston (1993)
Dupont, S., Bourlard, H., Deroo, O., Fontaine, V., Boite, J.M.: Hybrid HMM/ANN Systems for Training Independent Tasks: Experiments on Phonebook and Related Improvements. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (1997)
Bradley, S., et al.: The mechanisms creating wind noise in microphones. University of Salford, Nokia Mobile Phones
Rabiner, L.: Techniques for Speech and Natural Language Recognition. Rutgers, The State University of New Jersey (2002)
Xiong, Z., Radhakrishnan, R., Divakaran, A., Huang, T.S.: Highlights extraction from sports video based on an audio-visual marker detection framework (2005)
Cole, R.A., Noel, M., Lander, T., Durham, T.: New Telephone Speech Corpora at CSLU. In: Proceedings of Eurospeech (1995)
Rey, P.-H.: Opportunities in Sport for Voice-Enabled Technologies. Master of Advanced Studies in Sport Administration and Technology thesis, AISTS (2006)
Stricker, C., Rey, P.-H.: How can voice-enabled technologies help athletes and coaches to become more efficient? In: 3rd Asia-Pacific Congress on Sports Technology, Singapore (2007)
Shneiderman, B.: The Limits of Speech Recognition. Communications of the ACM 43(9) (September 2000)
Grosso, M.A.: The long-Term Adoption of Speech Recognition in Medical Applications. George Washington University School of Medicine (2003)
Strayer, D.L., Johnson, W.A.: Driven to distraction: dual-task studies of simulated driving and conversing on a cellular phone. Psychol. Sci. 12, 462–466 (2001)
Wagen, J.-F., Imhalsy, M.: Conception de produits et de services basés sur la Reconnaissance Vocale: exemples d’une collaboration IDIAP/HES-SO. TIC day, Martigny, May 24 (2007), http://home.hefr.ch/wagen/Imhost_Humavox_TicDay_Final.pdf
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Stricker, C. et al. (2009). Intelligent Multi-modal Interfaces for Mobile Applications in Hostile Environment(IM-HOST). In: Lalanne, D., Kohlas, J. (eds) Human Machine Interaction. Lecture Notes in Computer Science, vol 5440. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00437-7_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-00437-7_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00436-0
Online ISBN: 978-3-642-00437-7
eBook Packages: Computer ScienceComputer Science (R0)