Skip to main content

Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9924))

Included in the following conference series:

Abstract

We propose a VAD using long-term 200 ms Mel frequency band statistics, auditory masking, and a pre-trained two level decision tree ensemble based classifier, which allows capturing syllable level structure of speech and discriminating it from common noises. Proposed algorithm demonstrates on the test dataset almost 100 % acceptance of clear voice for English, Chinese, Russian, and Polish speech and 100 % rejection of stationary noises independently of loudness. The algorithm is aimed to be used as a trigger for ASR. It reuses short-term FFT analysis (STFFT) from ASR frontend with additional 2 KB memory and 15 % complexity overhead.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Demand: Diverse environments multichannel acoustic noise database. http://parole.loria.fr/DEMAND/. Accessed 20 Mar 2016

  2. Google WebRTC. https://webrtc.org/. Accessed 20 Mar 2016

  3. Nuance SREC. https://android.googlesource.com/platform/frameworks/base/+/android-4.4_r1/core/java/android/speech/srec/Recognizer.java. Accessed 20 Mar 2016

  4. Tsi EG 202 396–1 speech, multimedia transmission quality (STQ); part 1: Background noise simulation technique and background noise database, March 2009

    Google Scholar 

  5. Source separation in the presence of real-world background noise: test database for 2 channels case (2010). http://www.irisa.fr/metiss/SiSEC10/noise/SiSEC2010_diffuse_noise_2ch.html. Accessed 20 Mar 2016

  6. Fastl, H., Zwicker, E.: Psychoacoustics: Facts and Models, vol. 22. Springer, Heidelberg (2006)

    Google Scholar 

  7. Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S.: DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1–1.1. NASA STI/Recon Technical Report N 93 (1993)

    Google Scholar 

  8. Graf, S., Herbig, T., Buck, M., Schmidt, G.: Features for voice activity detection: a comparative analysis. EURASIP J. Adv. Signal Process. 2015(1), 1–15 (2015)

    Article  Google Scholar 

  9. Hirsch, H.G., Pearce, D.: The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW) (2000)

    Google Scholar 

  10. Fant, G.: Acoustic Theory of Speech Production: with Calculations based on X-Ray Studies of Russian Articulations. Description and Analysis of Contemporary Standard Russian. De Gruyter (1971). ISBN: 9783110873429. https://books.google.ru/books?id=UY0iAAAAQBAJ

  11. Ramırez, J., Segura, J.C., Benıtez, C., De La Torre, A., Rubio, A.: Efficient voice activity detection algorithms using long-term speech information. Speech Commun. 42(3), 271–287 (2004)

    Article  Google Scholar 

  12. Sohn, J., Kim, N.S., Sung, W.: A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)

    Article  Google Scholar 

  13. Zhou, Z.H.: Ensemble Methods: Foundations and Algorithms. CRC Press, Florida (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mikhail Moiseev .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Salishev, S., Barabanov, A., Kocharov, D., Skrelin, P., Moiseev, M. (2016). Voice Activity Detector (VAD) Based on Long-Term Mel Frequency Band Features. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45510-5_40

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45509-9

  • Online ISBN: 978-3-319-45510-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics