Abstract
We propose a VAD using long-term 200 ms Mel frequency band statistics, auditory masking, and a pre-trained two level decision tree ensemble based classifier, which allows capturing syllable level structure of speech and discriminating it from common noises. Proposed algorithm demonstrates on the test dataset almost 100 % acceptance of clear voice for English, Chinese, Russian, and Polish speech and 100 % rejection of stationary noises independently of loudness. The algorithm is aimed to be used as a trigger for ASR. It reuses short-term FFT analysis (STFFT) from ASR frontend with additional 2 KB memory and 15 % complexity overhead.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Demand: Diverse environments multichannel acoustic noise database. http://parole.loria.fr/DEMAND/. Accessed 20 Mar 2016
Google WebRTC. https://webrtc.org/. Accessed 20 Mar 2016
Nuance SREC. https://android.googlesource.com/platform/frameworks/base/+/android-4.4_r1/core/java/android/speech/srec/Recognizer.java. Accessed 20 Mar 2016
Tsi EG 202 396–1 speech, multimedia transmission quality (STQ); part 1: Background noise simulation technique and background noise database, March 2009
Source separation in the presence of real-world background noise: test database for 2 channels case (2010). http://www.irisa.fr/metiss/SiSEC10/noise/SiSEC2010_diffuse_noise_2ch.html. Accessed 20 Mar 2016
Fastl, H., Zwicker, E.: Psychoacoustics: Facts and Models, vol. 22. Springer, Heidelberg (2006)
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1–1.1. NASA STI/Recon Technical Report N 93 (1993)
Graf, S., Herbig, T., Buck, M., Schmidt, G.: Features for voice activity detection: a comparative analysis. EURASIP J. Adv. Signal Process. 2015(1), 1–15 (2015)
Hirsch, H.G., Pearce, D.: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW) (2000)
Fant, G.: Acoustic Theory of Speech Production: with Calculations based on X-Ray Studies of Russian Articulations. Description and Analysis of Contemporary Standard Russian. De Gruyter (1971). ISBN: 9783110873429. https://books.google.ru/books?id=UY0iAAAAQBAJ
Ramırez, J., Segura, J.C., Benıtez, C., De La Torre, A., Rubio, A.: Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42(3), 271–287 (2004)
Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)
Zhou, Z.H.: Ensemble Methods: Foundations and Algorithms. CRC Press, Florida (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Salishev, S., Barabanov, A., Kocharov, D., Skrelin, P., Moiseev, M. (2016). Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_40
Download citation
DOI: https://doi.org/10.1007/978-3-319-45510-5_40
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45509-9
Online ISBN: 978-3-319-45510-5
eBook Packages: Computer ScienceComputer Science (R0)