Multimodal Dimensional and Continuous Emotion Recognition in Dyadic Video Interactions

Zhao, Jinming; Chen, Shizhe; Jin, Qin

doi:10.1007/978-3-030-00776-8_28

Jinming Zhao¹⁸,
Shizhe Chen¹⁸ &
Qin Jin¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11164))

Included in the following conference series:

Pacific Rim Conference on Multimedia

3758 Accesses
1 Citations

Abstract

Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In dyadic human-human interactions, a more complex interaction scenario, a person’s emotion state will be influenced by the interlocutor’s behaviors, such as talking style/prosody, speech content, facial expression and body language. Mutual influence, a person’s influence on the interacting partner’s behaviors in a dialog, is shown to be important for predicting the person’s emotion state in previous works. In this paper, we proposed several multimodal interaction strategies to imitate the interactive patterns in the real scenarios for exploring the effect of mutual influence in continuous emotion prediction tasks. Our experiments based on the Audio/Visual Emotion Challenge (AVEC) 2017 dataset used in continuous emotion prediction tasks, and the results show that our proposed multimodal interaction strategy gains 3.82% and 3.26% absolute improvement on arousal and valence respectively. Additionally, we analyse the influence of the correlation between the interactive pairs on both arousal and valence. Our experimental results show that the interactive pairs with strong correlation significantly outperform the pairs with weak correlation on both arousal and valence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://sewaproject.eu.
2.
https://www.tensorflow.org.

References

Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video (2016)
Google Scholar
Black, M., Katsamanis, A., Lee, C.C., et al.: Automatic classification of married couples’ behavior using audio features. In: INTERSPEECH (2010)
Google Scholar
Brady, K., Gwon, Y., Khorrami, P., et al.: Multi-modal audio, video and physiological sensor learning for continuous emotion prediction. In: AVEC (2016)
Google Scholar
Busso, C., Bulut, M., Lee, C.C.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)
Article Google Scholar
Chen, S., Jin, Q.: Multi-modal dimensional emotion recognition using recurrent neural networks. In: AVEC (2015)
Google Scholar
Chen, S., Jin, Q.: Multi-modal conditional attention fusion for dimensional emotion prediction (2017)
Google Scholar
Chen, S., Jin, Q., Zhao, J., et al.: Multimodal multi-task learning for dimensional and continuous emotion recognition. In: AVEC (2017)
Google Scholar
Conati, C.: Probabilistic assessment of user’s emotions in educational games. Appl. Artif. Intell. 16, 555–575 (2002)
Article Google Scholar
Fragopanagos, N., Taylor, J.G.: Emotion recognition in human-computer interaction. IEEE Sig. Process. Mag. (2002)
Google Scholar
Hershey, S., Chaudhuri, S., Ellis, D.P.W., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. CoRR (2016)
Google Scholar
Lee, C.C., Busso, C., Lee, S., et al.: Modeling mutual influence of interlocutor emotion states in dyadic spoken interactions. In: INTERSPEECH (2009)
Google Scholar
Lin, L.I.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1), 255–268 (1989)
Article Google Scholar
Mariooryad, S., Busso, C.: Exploring cross-modality affective reactions for audiovisual emotion recognition. IEEE Trans. Affect. Comput. 4(2), 183–196 (2013)
Article Google Scholar
Metallinou, A., Katsamanis, A., Narayanan, S.: A hierarchical framework for modeling multimodality and emotional evolution in affective dialogs. In: ICASSP (2012)
Google Scholar
Ringeval, F., Pantic, M., Schuller, B., et al.: Avec 2017: Real-life depression, and affect recognition workshop and challenge. In: AVEC Workshop (2017)
Google Scholar
Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. Comput. Sci. (2014)
Google Scholar
Wöllmer, M., Metallinou, A., Eyben, F., et al.: Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: INTERSPEECH (2010)
Google Scholar

Download references

Acknowledgment

This work is supported by National Key Research and Development Plan under Grant No. 2016YFB1001202. This work is partially supported by National Natural Science Foundation of China (Grant No. 61772535). We also appreciate the support from the National Demonstration Center for Experimental Education of Information Technology and Management (Renmin University of China).

Author information

Authors and Affiliations

Renmin University of China, Beijing, China
Jinming Zhao, Shizhe Chen & Qin Jin

Authors

Jinming Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Shizhe Chen
View author publications
You can also search for this author in PubMed Google Scholar
Qin Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qin Jin .

Editor information

Editors and Affiliations

Hefei University of Technology, Hefei, China
Richang Hong
National Chiao Tung University, Hsinchu, Taiwan
Wen-Huang Cheng
University of Tokyo, Tokyo, Japan
Toshihiko Yamasaki
Hefei University of Technology, Hefei, China
Meng Wang
City University of Hong Kong, Hong Kong, Hong Kong
Chong-Wah Ngo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, J., Chen, S., Jin, Q. (2018). Multimodal Dimensional and Continuous Emotion Recognition in Dyadic Video Interactions. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11164. Springer, Cham. https://doi.org/10.1007/978-3-030-00776-8_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-00776-8_28
Published: 19 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00775-1
Online ISBN: 978-3-030-00776-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics