Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features

Salishev, Sergey; Barabanov, Andrey; Kocharov, Daniil; Skrelin, Pavel; Moiseev, Mikhail

doi:10.1007/978-3-319-45510-5_40

Sergey Salishev¹⁷,
Andrey Barabanov¹⁷,
Daniil Kocharov¹⁷,
Pavel Skrelin¹⁷ &
…
Mikhail Moiseev¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9924))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1865 Accesses
7 Citations

Abstract

We propose a VAD using long-term 200 ms Mel frequency band statistics, auditory masking, and a pre-trained two level decision tree ensemble based classifier, which allows capturing syllable level structure of speech and discriminating it from common noises. Proposed algorithm demonstrates on the test dataset almost 100 % acceptance of clear voice for English, Chinese, Russian, and Polish speech and 100 % rejection of stationary noises independently of loudness. The algorithm is aimed to be used as a trigger for ASR. It reuses short-term FFT analysis (STFFT) from ASR frontend with additional 2 KB memory and 15 % complexity overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Demand: Diverse environments multichannel acoustic noise database. http://parole.loria.fr/DEMAND/. Accessed 20 Mar 2016
Google WebRTC. https://webrtc.org/. Accessed 20 Mar 2016
Nuance SREC. https://android.googlesource.com/platform/frameworks/base/+/android-4.4_r1/core/java/android/speech/srec/Recognizer.java. Accessed 20 Mar 2016
Tsi EG 202 396–1 speech, multimedia transmission quality (STQ); part 1: Background noise simulation technique and background noise database, March 2009
Google Scholar
Source separation in the presence of real-world background noise: test database for 2 channels case (2010). http://www.irisa.fr/metiss/SiSEC10/noise/SiSEC2010_diffuse_noise_2ch.html. Accessed 20 Mar 2016
Fastl, H., Zwicker, E.: Psychoacoustics: Facts and Models, vol. 22. Springer, Heidelberg (2006)
Google Scholar
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1–1.1. NASA STI/Recon Technical Report N 93 (1993)
Google Scholar
Graf, S., Herbig, T., Buck, M., Schmidt, G.: Features for voice activity detection: a comparative analysis. EURASIP J. Adv. Signal Process. 2015(1), 1–15 (2015)
Article Google Scholar
Hirsch, H.G., Pearce, D.: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW) (2000)
Google Scholar
Fant, G.: Acoustic Theory of Speech Production: with Calculations based on X-Ray Studies of Russian Articulations. Description and Analysis of Contemporary Standard Russian. De Gruyter (1971). ISBN: 9783110873429. https://books.google.ru/books?id=UY0iAAAAQBAJ
Ramırez, J., Segura, J.C., Benıtez, C., De La Torre, A., Rubio, A.: Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42(3), 271–287 (2004)
Article Google Scholar
Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)
Article Google Scholar
Zhou, Z.H.: Ensemble Methods: Foundations and Algorithms. CRC Press, Florida (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Saint-Petersburg State University, Saint-Petersburg, Russia
Sergey Salishev, Andrey Barabanov, Daniil Kocharov & Pavel Skrelin
Intel Labs, Intel Corporation, Santa Clara, CA, 95054-1549, USA
Mikhail Moiseev

Authors

Sergey Salishev
View author publications
You can also search for this author in PubMed Google Scholar
Andrey Barabanov
View author publications
You can also search for this author in PubMed Google Scholar
Daniil Kocharov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Skrelin
View author publications
You can also search for this author in PubMed Google Scholar
Mikhail Moiseev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mikhail Moiseev .

Editor information

Editors and Affiliations

Masaryk University , Brno, Czech Republic
Petr Sojka
Masaryk University , Brno, Czech Republic
Aleš Horák
Masaryk University , Brno, Czech Republic
Ivan Kopeček
Masaryk University , Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Salishev, S., Barabanov, A., Kocharov, D., Skrelin, P., Moiseev, M. (2016). Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_40

Download citation

DOI: https://doi.org/10.1007/978-3-319-45510-5_40
Published: 03 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45509-9
Online ISBN: 978-3-319-45510-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics