Machine Learning Approaches for Speech Emotion Recognition: Classic and Novel Advances

Heracleous, Panikos; Ishikawa, Akio; Yasuda, Keiji; Kawashima, Hiroyuki; Sugaya, Fumiaki; Hashimoto, Masayuki

doi:10.1007/978-3-319-77116-8_14

Panikos Heracleous¹⁴,
Akio Ishikawa¹⁴,
Keiji Yasuda¹⁴,
Hiroyuki Kawashima¹⁴,
Fumiaki Sugaya¹⁴ &
…
Masayuki Hashimoto¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10762))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

1156 Accesses
2 Citations

Abstract

Speech is the most natural form of communication for human beings, and among others, it provides information about the speaker’s emotional state. The current study focuses on automatic speech emotion recognition based on classic and innovated machine learning approaches using simulated emotional speech data. Specifically, individual Gaussian mixture models (GMM) trained for each emotion, a universal background GMM model (UBM-GMM) adapted to each emotion using maximum posteriori (MAP) adaptation, and an approach based on i-vector paradigm, widely used in speaker recognition and language identification, and adapted to emotion recognition are used. When using individual GMMs, a novel technique based on multiple classifiers and late fusion is also applied. In this case, a 90.9% recognition rate is been obtained. When the state-of-the-art, i-vector paradigm based method, along with probabilistic linear discriminant analysis (PLDA) model is used, a 91.4% average rate for speaker-independent Japanese speech emotion recognition is achieved, which is a very promising result and superior to similar studies. In addition to the Japanese emotion recognition, pair-wise recognition for seven emotions in German language has also been conducted. The recognition rates obtained using the German database show the same tendency as in Japanese. In this experiment, an 89.2% average rate has been achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Busso, C., Bulut, M., Narayanan, S.: Toward effective automatic recognition systems of emotion in speech. In: Gratch, J., Marsella, S. (eds.) Social Emotions in Nature and Artifact: Emotions in Human and Human-Computer Interaction, pp. 110–127. Oxford University Press, New York (2013)
Chapter Google Scholar
Tang, H., Chu, S., Johnson, M.H.: Emotion recognition from speech via boosted Gaussian mixture models. In Proceedings of ICME, pp. 294–297 (2009)
Google Scholar
Xu, S., Liu, Y., Liu, X.: Speaker recognition and speech emotion recognition based on GMM. In: 3rd International Conference on Electric and Electronics (EEIC 2013), pp. 434–436 (2013)
Google Scholar
Schuller, B., Rigoll, G., Lang, M.: Hidden Markov model-based speech emotion recognition. In: Proceedings of IEEE ICASSP, vol. I, pp. 401–404 (2003)
Google Scholar
Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012)
Google Scholar
Nicholson, J., Takahashi, K., Nakatsu, R.: Emotion recognition in speech using neural networks. NCA 9(4), 290296 (2000)
Article Google Scholar
Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Proceedings of Interspeech, pp. 223–227 (2014)
Google Scholar
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Article Google Scholar
Fosler-Lussier, E., Amdal, I., Kuo, H.: A framework for predicting speech recognition errors. Speech Commun. 46, 153–170 (2005)
Article Google Scholar
Silva, J., Narayanan, S.: Average divergence distance as a statistical discrimination measure for hidden Markov models. IEEE Trans. Speech Audio Process. 14, 890–906 (2006)
Article Google Scholar
Yamamoto, K., Nakagawa, S.: Differences of speech rate, interphoneme distance and likelihood caused by speaking style, their relationship and recognition performance. Syst. Comput. Jpn 33(7), 50–60 (2002)
Article Google Scholar
Sahidullah, M., Saha, G.: Design analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Commun. 54(4), 543565 (2012)
Article Google Scholar
O’Shaughnessy, D.: Linear predictive coding. IEEE Potentials 7(1), 29–32 (1988)
Article Google Scholar
Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. J. AcousL Soc. Am. 87(4), 1738–1752 (1990)
Article Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In Proceedings of Interspeech, pp. 1517–1520 (2005)
Google Scholar
Juang, B.H., Rabiner, L.: A probabilistic distance measure for hidden Markov models. AT&T Tech. J. 391–408 (1985)
Google Scholar
Metallinou, A., Lee, S., Narayanan, S.: Decision level combination of multiple modalities for recognition and analysis of emotional expression. In: Proceedings of ICASSP, pp. 2462–2465 (2019)
Google Scholar
Prince, S., Elder, J.: Probabilistic linear discriminant analysis for inferences about identity. In: Proceedings of International Conference on Computer Vision, pp. 1–8 (2007)
Google Scholar
Lee, C.M., Yildirim, S., Bulut, M., Kazemzadeh, A., Busso, C., Deng, Z., Lee, S., Narayanan, S.: Emotion recognition based on phoneme classes. In: Proceedings of ICSLP, pp. 889–892 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama, 356-8502, Japan
Panikos Heracleous, Akio Ishikawa, Keiji Yasuda, Hiroyuki Kawashima, Fumiaki Sugaya & Masayuki Hashimoto

Authors

Panikos Heracleous
View author publications
You can also search for this author in PubMed Google Scholar
Akio Ishikawa
View author publications
You can also search for this author in PubMed Google Scholar
Keiji Yasuda
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Kawashima
View author publications
You can also search for this author in PubMed Google Scholar
Fumiaki Sugaya
View author publications
You can also search for this author in PubMed Google Scholar
Masayuki Hashimoto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Panikos Heracleous .

Editor information

Editors and Affiliations

CIC, Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Heracleous, P., Ishikawa, A., Yasuda, K., Kawashima, H., Sugaya, F., Hashimoto, M. (2018). Machine Learning Approaches for Speech Emotion Recognition: Classic and Novel Advances. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10762. Springer, Cham. https://doi.org/10.1007/978-3-319-77116-8_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-77116-8_14
Published: 10 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77115-1
Online ISBN: 978-3-319-77116-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics