Skip to main content

Multimodal Dimensional and Continuous Emotion Recognition in Dyadic Video Interactions

  • Conference paper
  • First Online:
Advances in Multimedia Information Processing – PCM 2018 (PCM 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11164))

Included in the following conference series:

Abstract

Automatic emotion recognition is a challenging task which can make great impact on improving natural human computer interactions. In dyadic human-human interactions, a more complex interaction scenario, a person’s emotion state will be influenced by the interlocutor’s behaviors, such as talking style/prosody, speech content, facial expression and body language. Mutual influence, a person’s influence on the interacting partner’s behaviors in a dialog, is shown to be important for predicting the person’s emotion state in previous works. In this paper, we proposed several multimodal interaction strategies to imitate the interactive patterns in the real scenarios for exploring the effect of mutual influence in continuous emotion prediction tasks. Our experiments based on the Audio/Visual Emotion Challenge (AVEC) 2017 dataset used in continuous emotion prediction tasks, and the results show that our proposed multimodal interaction strategy gains 3.82% and 3.26% absolute improvement on arousal and valence respectively. Additionally, we analyse the influence of the correlation between the interactive pairs on both arousal and valence. Our experimental results show that the interactive pairs with strong correlation significantly outperform the pairs with weak correlation on both arousal and valence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://sewaproject.eu.

  2. 2.

    https://www.tensorflow.org.

References

  1. Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video (2016)

    Google Scholar 

  2. Black, M., Katsamanis, A., Lee, C.C., et al.: Automatic classification of married couples’ behavior using audio features. In: INTERSPEECH (2010)

    Google Scholar 

  3. Brady, K., Gwon, Y., Khorrami, P., et al.: Multi-modal audio, video and physiological sensor learning for continuous emotion prediction. In: AVEC (2016)

    Google Scholar 

  4. Busso, C., Bulut, M., Lee, C.C.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335 (2008)

    Article  Google Scholar 

  5. Chen, S., Jin, Q.: Multi-modal dimensional emotion recognition using recurrent neural networks. In: AVEC (2015)

    Google Scholar 

  6. Chen, S., Jin, Q.: Multi-modal conditional attention fusion for dimensional emotion prediction (2017)

    Google Scholar 

  7. Chen, S., Jin, Q., Zhao, J., et al.: Multimodal multi-task learning for dimensional and continuous emotion recognition. In: AVEC (2017)

    Google Scholar 

  8. Conati, C.: Probabilistic assessment of user’s emotions in educational games. Appl. Artif. Intell. 16, 555–575 (2002)

    Article  Google Scholar 

  9. Fragopanagos, N., Taylor, J.G.: Emotion recognition in human-computer interaction. IEEE Sig. Process. Mag. (2002)

    Google Scholar 

  10. Hershey, S., Chaudhuri, S., Ellis, D.P.W., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)

    Google Scholar 

  11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  12. Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. CoRR (2016)

    Google Scholar 

  13. Lee, C.C., Busso, C., Lee, S., et al.: Modeling mutual influence of interlocutor emotion states in dyadic spoken interactions. In: INTERSPEECH (2009)

    Google Scholar 

  14. Lin, L.I.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45(1), 255–268 (1989)

    Article  Google Scholar 

  15. Mariooryad, S., Busso, C.: Exploring cross-modality affective reactions for audiovisual emotion recognition. IEEE Trans. Affect. Comput. 4(2), 183–196 (2013)

    Article  Google Scholar 

  16. Metallinou, A., Katsamanis, A., Narayanan, S.: A hierarchical framework for modeling multimodality and emotional evolution in affective dialogs. In: ICASSP (2012)

    Google Scholar 

  17. Ringeval, F., Pantic, M., Schuller, B., et al.: Avec 2017: Real-life depression, and affect recognition workshop and challenge. In: AVEC Workshop (2017)

    Google Scholar 

  18. Sak, H., Senior, A., Beaufays, F.: Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. Comput. Sci. (2014)

    Google Scholar 

  19. Wöllmer, M., Metallinou, A., Eyben, F., et al.: Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: INTERSPEECH (2010)

    Google Scholar 

Download references

Acknowledgment

This work is supported by National Key Research and Development Plan under Grant No. 2016YFB1001202. This work is partially supported by National Natural Science Foundation of China (Grant No. 61772535). We also appreciate the support from the National Demonstration Center for Experimental Education of Information Technology and Management (Renmin University of China).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qin Jin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhao, J., Chen, S., Jin, Q. (2018). Multimodal Dimensional and Continuous Emotion Recognition in Dyadic Video Interactions. In: Hong, R., Cheng, WH., Yamasaki, T., Wang, M., Ngo, CW. (eds) Advances in Multimedia Information Processing – PCM 2018. PCM 2018. Lecture Notes in Computer Science(), vol 11164. Springer, Cham. https://doi.org/10.1007/978-3-030-00776-8_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00776-8_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00775-1

  • Online ISBN: 978-3-030-00776-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics