Skip to main content
Log in

DNN and i-vector combined method for speaker recognition on multi-variability environments

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The article deals with the compensation of variability in Automatic Speaker Verification systems in scenarios where the variability conditions due to utterance duration, reverberation and environmental noise are simultaneously present. We introduce a new representation of the speaker’s discriminative information, based on the use of a deep neural network trained discriminatively for speaker classification and i-vector representation. The proposed representation allows us to increase the verification performance by reducing the error between 2.5 and 7.9 % for all variability conditions compared to baseline systems. We also analyze the speaker verification system robustness based on interquartile range, obtaining a 1.19 times improvement compared to baselines evaluated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The UBM refers to a Universal Background Model of the population.

  2. http://dnt.kr.hsnr.de/download.html.

References

  • Al-Ali, A. K. H., Senadji, B., & Naik, G. R. (2017). Enhanced forensic speaker verification using multi-run ica in the presence of environmental noise and reverberation conditions. In: Proceedings of ICSIPA. IEEE, pp 174–179.

  • Alam, M. J., Kenny, P., Bhattacharya, G., & Kockmann, M. (2017). Speaker verification under adverse conditions using i-vector adaptation and neural networks. In: Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20–24, 2017, pp 3732–3736.

  • Avila, A. R., Paja, M. O. S., & Fraga, F. J., et al. (2014). Improving the performance of far-field speaker verification using multi-condition training: the case of GMM-UBM and i-vector systems. In: INTERSPEECH 2014, 15th Annual Conference of the International Speech Communication Association, Singapore, September 14–18, 2014, pp 1096–1100.

  • Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  • Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Trans Audio, Speech & Language Processing, 19(4), 788–798.

    Article  Google Scholar 

  • Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length normalization in speaker recognition systems. In: INTERSPEECH 2011, 12th Annual conference of the international speech communication association, Florence, Italy, August 27–31, 2011, pp 249–252.

  • Garcia-Romero, D., Zhou, X., & Espy-Wilson, C. Y. (2012). Multicondition training of gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition. In: 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto, Japan, March 25-30, 2012, pp 4257–4260.

  • Gonzalez-Rodriguez, J. (2014). Evaluating automatic speaker recognition systems: An overview of the nist speaker recognition evaluations (1996–2014). Loquens, 1(1), 007.

    Article  Google Scholar 

  • Greenberg, C. S., Stanford, V. M., Martin, A. F., Yadagiri, M., Doddington, G. R., Godfrey, J. J., & Hernandez-Cordero, J. (2013). The 2012 NIST speaker recognition evaluation. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, France, August 25–29, 2013, pp 1971–1975.

  • Guo, J., Xu, N., Qian, K., Shi, Y., Xu, K., Wu, Y., et al. (2018). Deep neural network based i-vector mapping for speaker verification using short utterances. Speech Communication, 105, 92–102.

    Article  Google Scholar 

  • Hinton, G., Deng, L., Yu, D., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  • Hinton, G. E. (2012). A practical guide to training restricted boltzmann machines. Neural networks: Tricks of the trade (2nd ed., pp. 599–619). Berlin, Heidelberg: Springer.

    Chapter  Google Scholar 

  • Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507.

    Article  MathSciNet  Google Scholar 

  • Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7), 1527–1554.

    Article  MathSciNet  Google Scholar 

  • Kenny, P. (2005). Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal (Report) CRIM-06/08-13.

  • Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In: Odyssey 2010: The speaker and language recognition workshop, Brno, Czech Republic, June 28–July 1, 2010, p. 14.

  • Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2005). Factor analysis simplified. In: 2005 IEEE international conference on acoustics, speech, and signal processing, ICASSP ’05, Philadelphia, Pennsylvania, USA, March 18–23, 2005, pp. 637–640.

  • Kenny, P., Stafylakis, T., Ouellet, P., Gupta, V., & Alam, MJ. (2014). Deep neural networks for extracting baum-welch statistics for speaker recognition. In: Odyssey 2014: The speaker and language recognition workshop, Joensuu, Finland, June 16–19, 2014.

  • Kheder, W. B., Matrouf, D., Ajili, M., & Bonastre, J. F. (2018). A unified joint model to deal with nuisance variabilities in the i-vector space. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(3), 633–645.

    Article  Google Scholar 

  • Kim, C., & Stern, R. M. (2016). Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Trans Audio, Speech & Language Processing, 24(7), 1315–1329.

    Article  Google Scholar 

  • Kinoshita, K., Delcroix, M., Yoshioka, T., & Nakatani, T., et al. (2013). The reverb challenge: Acommon evaluation framework for dereverberation and recognition of reverberant speech. In: IEEE workshop on applications of signal processing to audio and acoustics, WASPAA 2013, New Paltz, NY, USA, October 20–23, 2013, pp. 1–4.

  • Kudashev, O., Novoselov, S., Pekhovsky, T., Simonchik, K., & Lavrentyeva, G. (2016). Usage of DNN in speaker recognition: Advantages and problems. In: Advances in Neural Networks-ISNN 2016, 13th International symposium on neural networks, ISNN 2016, St. Petersburg, Russia, July 6–8, 2016, Proceedings, pp 82–91.

  • Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE international conference on acoustics, speech and signal processing, ICASSP 2014, Florence, Italy, May 4–9, 2014, pp 1695–1699.

  • Ma, J., Sethu, V., Ambikairajah, E., & Lee, K. A. (2017). Duration compensation of i-vectors for short duration speaker verification. Electronics Letters, 53(6), 405–407.

    Article  Google Scholar 

  • Mohamed, A., Dahl, G. E., & Hinton, G. E. (2012). Acoustic modeling using deep belief networks. IEEE Trans Audio, Speech & Language Processing, 20(1), 14–22.

    Article  Google Scholar 

  • Novotný, O., Plchot, O., Matejka, P., Mosner, L., & Glembek, O. (2018). On the use of x-vectors for robust speaker recognition. Odyssey the speaker and language recognition workshop, 26–29 June 2018, (pp. 168–175). Les Sables d’Olonne.

  • Pekhovsky, T., Novoselov, S., Sholohov, A., & Kudashev, O. (2016). On autoencoders in the i-vector space for speaker recognition. In: Odyssey 2016: The speaker and language recognition workshop, Bilbao, Spain, June 21–24, 2016, pp 217–224.

  • Poddar, A., Sahidullah, M., & Saha, G. (2017). Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics, 7(2), 91–101.

    Article  Google Scholar 

  • Rajan, P., Kinnunen, T., & Hautamäki, V. (2013). Effect of multicondition training on i-vector PLDA configurations for speaker recognition. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, France, August 25–29, 2013, pp 3694–3697.

  • Reyes-Díaz, F. J., Hernández-Sierra, G., & Calvo-de Lara, J. R. (2017). Two-space variability compensation technique for speaker verification in short length and reverberant environments. International Journal of Speech Technology (IJST), 20(3), 475–485.

    Article  Google Scholar 

  • Reyes-Díaz, F. J., Roble-Gutiérres, A., Hernández-Sierra, G., & Calvo-de Lara, J. R. (2018). Filtrado wiener para la reducción de ruido en la verificación de locutores. Revista Cubana de Ciencias Informáticas (RCCI), 12(3), 152–162.

    Google Scholar 

  • Ribas, D., Vincent, E., & Calvo-de Lara, J. R. (2015). Full multicondition training for robust i-vector based speaker recognition. In: INTERSPEECH 2015, 16th annual conference of the international speech communication association, Dresden, Germany, September 6–10, 2015, pp 1057–1061.

  • Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22(10), 1671–1675.

    Article  Google Scholar 

  • Saad, D. (2010). On-line learning in neural networks. Cambridge: Cambridge University Press.

    Google Scholar 

  • Scheffer, N., Ferrer, L., Lawson, A., Lei, Y., & McLaren, M. (2013). Recent developments in voice biometrics: Robustness and high accuracy. In: 2013 IEEE international conference on technologies for homeland security (HST), pp 447–452.

  • Senior, A. W., Sak, H., & Shafran, I. (2015). Context dependent phone models for LSTM RNN acoustic modelling. In: 2015 IEEE international conference on acoustics, speech and signal processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19–24, 2015, pp 4585–4589.

  • Snyder, D., Garcia-Romero, D., Povey, D., & Khudanpur, S. (2017). Deep neural network embeddings for text-independent speaker verification. In: Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, August 20–24, 2017, pp 999–1003.

  • Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing, ICASSP 2018, Calgary, AB, Canada, April 15–20, 2018, pp. 5329–5333.

  • Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6(1), 1–3.

    Article  Google Scholar 

  • Solanas, A., & Pérez, A. (2004). Estadística descriptiva en ciencias del comportamiento. Thomson, https://books.google.com.cu/books?id=NOBYAAAACAAJ.

  • Xu, L., Das, RK., Yılmaz, E., Yang, J., & Li, H. (2018). Generative x-vectors for text-independent speaker verification. arXiv preprint arXiv:180906798.

  • Zhang, C., & Koishida, K. (2017). End-to-end text-independent speaker verification with triplet loss on short utterances. In: Interspeech 2017, 18th annual conference of the international speech communication association, Stockholm, Sweden, August 20–24, 2017, pp 1487–1491.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Flavio J. Reyes-Díaz.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Reyes-Díaz, F.J., Hernández-Sierra, G. & de Lara, J.R.C. DNN and i-vector combined method for speaker recognition on multi-variability environments. Int J Speech Technol 24, 409–418 (2021). https://doi.org/10.1007/s10772-021-09796-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09796-1

Keywords

Navigation