Abstract
The Viterbi dynamic programming algorithm is currently the de-facto standard for speech recognizers to deal with duration variations of the sub-word units of speech by properly aligning the sub-word units to the sub-word unit models. The algorithm is an integral part of the hidden Markov model speech recognizers. In this work a robust and simple voice command system is developed, implemented and tested. It uses a novel speech alignment algorithm, the so-called “run-length limited dynamic programming algorithm” (RLL-DP) instead. The voice command system described hereinafter facilitates the operation of the AUTONOMY system, which is an environmental control system combined with an alternative and augmentative communication system, using isolated words as voice commands. The activation of “run-length limits” causes a statistically significant reduction of the word error rate, even when using simple “centroid sequence word models” instead of acoustic models based on “hidden control neural networks” used in previous versions.
Similar content being viewed by others
References
Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20, 30–42.
Do, V. H. (2011). Hybrid architectures for speech recognition. PhD Thesis, Nanyang, China: Nanyang Technological University.
Do, V. H., Xiao, X., & Chng, E. S. (2011). Comparison and combination of multilayer perceptrons and deep belief networks in hybrid automatic speech recognition systems. In Proceedings of the Asia-Pacific signal and information processing association annual summit and conference, APSIPA ASC, October 2011, Xi’an, China.
Ferguson, J. D. (1980). Variable duration models for speech. In Symposium on the application of hidden Markov models to text and speech, October 1980 (pp. 143–179). Princeton: Institute for Defense Analyses.
Forney, G. D. (1973). The viterbi algorithm. Proceedings of the IEEE, 61, 268–278.
Fukunaga, K. (1990). Introduction to statistical pattern recognition. Boston: Academic Press.
Gu, L., Harris, J. G., Shrivastav, R. S., & Sapienza, C. (2005). Disordered speech assessment using automatic methods based on quantitative measures. EURASIP Journal on Applied Signal Processing, 9, 1400–1409.
Hawley, M. S., Enderby, P., Green, P., Cunningham, S., Brownsell, S., Carmichael, J., Parker, M., Hatzis, A., O’Neill, P., & Palmer, R. (2007). A speech-controlled environmental control system for people with severe dysarthria. Medical Engineering & Physics, 29, 586–593.
Hickersberger, H. (1998). Spracherkennung mit hidden control neural networks. E&I. Elektrotechnik und Informationstechnik, 115, 245–250.
Hüsken, M., & Stagge, P. (2003). Recurrent neural networks for time series classification. Neurocomputing, 50, 223–235.
Iso, K., & Watanabe, T. (1990). Speaker-independent word recognition using a neural prediction model. In Proceedings of the international conference on acoustics, speech, and signal processing, ICASSP, April 1990, Albuquerque, New Mexico, USA (pp. 441–444).
Jaitly, N., Nguyen, P., Senior, A., & Vanhoucke, V. (2012). Application of pretrained deep neural network to large vocabulary conversational speech recognition. In Proceedings of the 13th annual conference of the international speech communication association INTERSPEECH, September 2012. Portland: ISCA.
Levin, E. (1993). Hidden control neural architecture modeling of nonlinear time varying systems and its applications. IEEE Transactions on Neural Networks, 4, 109–116.
Levinson, S. E. (1986). Continuously variable duration hidden Markov models for speech analysis. In Proceedings of the international conference on acoustics, speech, and signal processing, ICASSP, April 1986, Tokyo, Japan (pp. 1241–1244).
Loidolt, G. (1995). AUTONOM III: Spracherkennung. Diploma thesis, Vienna, Austria: Vienna University of Technology.
Ostendorf, M., Digalakis, V. V., & Kimbal, O. A. (1996). From HMMs to segment models: a unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing, 4, 360–378.
Panek, P., Beck, C., Mina, S., Seisenbacher, G., & Zagler, W. L. (2002). Technical assistance of motor- and multiple disabled children—some long term experiences. In Lecture notes in computer science: Vol. 2398. Proceedings of the 8th international conference on computers helping people with special needs, ICCHP, July 2002, Linz, Austria (pp. 181–188).
Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77, 257–286.
Ramesh, P., & Wilpon, J. G. (1992). Modeling state durations in hidden Markov models for automatic speech recognition. In Proceedings of the international conference on acoustics, speech, and signal processing, ICASSP, March 1992, San Francisco, California, USA (pp. 381–384).
Tebelskis, J., & Waibel, A. (1990). Large vocabulary recognition using linked predictive neural networks. In Proceedings of the international conference on acoustics, speech, and signal processing, ICASSP, April 1990, Albuquerque, New Mexico, USA (pp. 437–440).
Tschirk, W. (2001). Neural net speech recognizers—voice remote control devices for disabled people. E&I. Elektrotechnik und Informationstechnik, 118, 367–370.
Vaseghi, S. V. (1991). Hidden Markov models with duration-dependent state transition probabilities. Electronics Letters, 27, 625–626.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37, 328–339.
Widrow, B., & Lehr, M. (1973). 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proceedings of the IEEE, 78, 1415–1442.
Yaniv, R., & Burshtein, D. (2003). An enhanced dynamic time warping model for improved estimation of DTW parameters. IEEE Transactions on Speech and Audio Processing, 11, 216–228.
Yu, S.-Z. (2010). Hidden semi-Markov models. Artificial Intelligence, 174, 215–243.
Zagler, W. L., Panek, P., & Flachberger, C. (1997). Technical assistance for severely motor- and multiple impaired children. In Proceedings of the 10th IEEE symposium on computer-based medical systems, June 1997, Maribor, Slovenia (pp. 232–237). Washington: IEEE Computer Society.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hickersberger, H., Zagler, W.L. A voice command system for AUTONOMY using a novel speech alignment algorithm. Int J Speech Technol 16, 461–469 (2013). https://doi.org/10.1007/s10772-013-9196-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-013-9196-2