Abstract
Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In dyadic human-human interactions, a more complex interaction scenario, a person’s emotion state will be influenced by the interlocutor’s behaviors, such as talking style/prosody, speech content, facial expression and body language. Mutual influence, a person’s influence on the interacting partner’s behaviors in a dialog, is shown to be important for predicting the person’s emotion state in previous works. In this paper, we proposed several multimodal interaction strategies to imitate the interactive patterns in the real scenarios for exploring the effect of mutual influence in continuous emotion prediction tasks. Our experiments based on the Audio/Visual Emotion Challenge (AVEC) 2017 dataset used in continuous emotion prediction tasks, and the results show that our proposed multimodal interaction strategy gains 3.82% and 3.26% absolute improvement on arousal and valence respectively. Additionally, we analyse the influence of the correlation between the interactive pairs on both arousal and valence. Our experimental results show that the interactive pairs with strong correlation significantly outperform the pairs with weak correlation on both arousal and valence.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video (2016)
Black, M., Katsamanis, A., Lee, C.C., et al.: Automatic classification of married couples’ behavior using audio features. In: INTERSPEECH (2010)
Brady, K., Gwon, Y., Khorrami, P., et al.: Multi-modal audio, video and physiological sensor learning for continuous emotion prediction. In: AVEC (2016)
Busso, C., Bulut, M., Lee, C.C.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)
Chen, S., Jin, Q.: Multi-modal dimensional emotion recognition using recurrent neural networks. In: AVEC (2015)
Chen, S., Jin, Q.: Multi-modal conditional attention fusion for dimensional emotion prediction (2017)
Chen, S., Jin, Q., Zhao, J., et al.: Multimodal multi-task learning for dimensional and continuous emotion recognition. In: AVEC (2017)
Conati, C.: Probabilistic assessment of user’s emotions in educational games. Appl. Artif. Intell. 16, 555–575 (2002)
Fragopanagos, N., Taylor, J.G.: Emotion recognition in human-computer interaction. IEEE Sig. Process. Mag. (2002)
Hershey, S., Chaudhuri, S., Ellis, D.P.W., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. CoRR (2016)
Lee, C.C., Busso, C., Lee, S., et al.: Modeling mutual influence of interlocutor emotion states in dyadic spoken interactions. In: INTERSPEECH (2009)
Lin, L.I.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1), 255–268 (1989)
Mariooryad, S., Busso, C.: Exploring cross-modality affective reactions for audiovisual emotion recognition. IEEE Trans. Affect. Comput. 4(2), 183–196 (2013)
Metallinou, A., Katsamanis, A., Narayanan, S.: A hierarchical framework for modeling multimodality and emotional evolution in affective dialogs. In: ICASSP (2012)
Ringeval, F., Pantic, M., Schuller, B., et al.: Avec 2017: Real-life depression, and affect recognition workshop and challenge. In: AVEC Workshop (2017)
Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. Comput. Sci. (2014)
Wöllmer, M., Metallinou, A., Eyben, F., et al.: Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: INTERSPEECH (2010)
Acknowledgment
This work is supported by National Key Research and Development Plan under Grant No. 2016YFB1001202. This work is partially supported by National Natural Science Foundation of China (Grant No. 61772535). We also appreciate the support from the National Demonstration Center for Experimental Education of Information Technology and Management (Renmin University of China).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhao, J., Chen, S., Jin, Q. (2018). Multimodal Dimensional and Continuous Emotion Recognition in Dyadic Video Interactions. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11164. Springer, Cham. https://doi.org/10.1007/978-3-030-00776-8_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-00776-8_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00775-1
Online ISBN: 978-3-030-00776-8
eBook Packages: Computer ScienceComputer Science (R0)