Skip to main content

Speaker-Aware Training of Speech Emotion Classifier with Speaker Recognition

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12997))

Included in the following conference series:

Abstract

In this paper, we discuss the possibility to improve the accuracy of speech emotion recognition in multi-user systems. We assume that a small corpus of speech emotional data is available for each speaker of interest. It is proposed to train a speaker-independent emotion classifier of arbitrary audio features including deep embeddings and fine-tune it on the utterances of each speaker. As a result, every user is associated with his or her own emotion recognition model. The concrete fine-tuned classifier may be chosen using a speaker recognition algorithm or even fixed if the identity of a user is known. It is experimentally shown that the proposed approach makes it possible to significantly improve the quality of conventional speaker-independent emotion classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.kaggle.com/kuntaldas599/emotional-speech-classification2d.

  2. 2.

    https://www.kaggle.com/tracyporter/speech-emotion-recognition.

References

  1. Akçay, M.B., Oğuz, K.: Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 116, 56–76 (2020)

    Article  Google Scholar 

  2. Campr, P., Pražák, A., Psutka, J.V., Psutka, J.: Online speaker adaptation of an acoustic model using face recognition. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 378–385. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40585-3_48

    Chapter  Google Scholar 

  3. Cramer, J., Wu, H.H., Salamon, J., Bello, J.P.: Look, listen, and learn more: Design choices for deep audio embeddings. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3852–3856. IEEE (2019)

    Google Scholar 

  4. Dahake, P.P., Shaw, K., Malathi, P.: Speaker dependent speech emotion recognition using MFCC and support vector machine. In: Proceedings of International Conference on Automatic Control and Dynamic Optimization Techniques (ICACDOT), pp. 1080–1084. IEEE (2016)

    Google Scholar 

  5. Eyben, F., Wöllmer, M., Schuller, B.: OpenSmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 1459–1462 (2010)

    Google Scholar 

  6. Ezz-Eldin, M., Khalaf, A.A., Hamed, H.F., Hussein, A.I.: Efficient feature-aware hybrid model of deep learning architectures for speech emotion recognition. IEEE Access 9, 19999–20011 (2021)

    Article  Google Scholar 

  7. Farooq, M., Hussain, F., Baloch, N.K., Raja, F.R., Yu, H., Zikria, Y.B.: Impact of feature selection algorithm on speech emotion recognition using deep convolutional neural network. Sensors 20(21), 6008 (2020)

    Article  Google Scholar 

  8. Fayek, H.M.: Speech processing for machine learning: Filter banks, Mel-frequency cepstral coefficients (MFCCs) and what’s in-between (2016). https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

  9. Guliani, D., Beaufays, F., Motta, G.: Training speech recognition models with federated learning: a quality/cost framework. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3080–3084. IEEE (2021)

    Google Scholar 

  10. Haq, S., Jackson, P.J., Edge, J.: Speaker-dependent audio-visual emotion recognition. In: Proceedings of International Conference on Audio-Visual Speech Processing (AVSP), pp. 53–58 (2009)

    Google Scholar 

  11. Jalal, M.A., Loweimi, E., Moore, R.K., Hain, T.: Learning temporal clusters using capsule routing for speech emotion recognition. In: Proceedings of Interspeech, pp. 1701–1705. ISCA (2019)

    Google Scholar 

  12. Koluguri, N.R., Li, J., Lavrukhin, V., Ginsburg, B.: SpeakerNet: 1D depth-wise separable convolutional network for text-independent speaker recognition and verification. arXiv preprint arXiv:2010.12653 (2020)

  13. Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PloS One 13(5), e0196391 (2018)

    Google Scholar 

  14. Savchenko, A.V.: Efficient facial representations for age, gender and identity recognition in organizing photo albums using multi-output convnet. Peer J. Comput. Sci. 5, e197 (2019)

    Google Scholar 

  15. Savchenko, A.V.: Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. arXiv preprint arXiv:2103.17107 (2021)

  16. Savchenko, A.V., Savchenko, V.V.: A method for measuring the pitch frequency of speech signals for the systems of acoustic speech analysis. Measurement Tech. 62(3), 282–288 (2019)

    Article  Google Scholar 

  17. Sokolov, A., Savchenko, A.V.: Gender domain adaptation for automatic speech recognition. In: Proceedings of 19th World Symposium on Applied Machine Intelligence and Informatics (SAMI), pp. 000413–000418. IEEE (2021)

    Google Scholar 

  18. dos SP Soares, A., et al.: Energy-based voice activity detection algorithm using gaussian and cauchy kernels. In: Proceedings of the 9th Latin American Symposium on Circuits & Systems (LASCAS), pp. 1–4. IEEE (2018)

    Google Scholar 

  19. Zhang, Y., et al.: Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504 (2020)

  20. Zhao, Y., Li, J., Zhang, S., Chen, L., Gong, Y.: Domain and speaker adaptation for Cortana speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5984–5988. IEEE (2018)

    Google Scholar 

Download references

Acknowledgments

The work is supported by RSF (Russian Science Foundation) grant 20-71-10010.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrey V. Savchenko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Savchenko, L., V. Savchenko, A. (2021). Speaker-Aware Training of Speech Emotion Classifier with Speaker Recognition. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2021. Lecture Notes in Computer Science(), vol 12997. Springer, Cham. https://doi.org/10.1007/978-3-030-87802-3_55

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87802-3_55

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87801-6

  • Online ISBN: 978-3-030-87802-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics