Skip to main content

Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory

  • Conference paper
  • First Online:
Intelligent Virtual Agents (IVA 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10498))

Included in the following conference series:

Abstract

The face conveys a blend of verbal and nonverbal information playing an important role in daily interaction. While speech articulation mostly affects the orofacial areas, emotional behaviors are externalized across the entire face. Considering the relation between verbal and nonverbal behaviors is important to create naturalistic facial movements for conversational agents (CAs). Furthermore, facial muscles connect areas across the face, creating principled relationships and dependencies between the movements that have to be taken into account. These relationships are ignored when facial movements across the face are separately generated. This paper proposes to create speech-driven models that jointly capture the relationship not only between speech and facial movements, but also across facial movements. The input to the models are features extracted from speech that convey the verbal and emotional states of the speakers. We build our models with bidirectional long-short term memory (BLSTM) units which are shown to be very successful in modeling dependencies for sequential data. The objective and subjective evaluations of the results demonstrate the benefits of joint modeling of facial regions using this framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Albrecht, I., Schröder, M., Haber, J., Seidel, H.P.: Mixed feelings: expression of non-basic emotions in a muscle-based talking head. Virtual Reality 8(4), 201–212 (2005)

    Article  Google Scholar 

  2. Anderson, R., Stenger, B., Wan, V., Cipolla, R.: Expressive visual text-to-speech using active appearance models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, OR, pp. 3382–3389, June 2013

    Google Scholar 

  3. Balci, K.: Xface: MPEG-4 based open source toolkit for 3D facial animation. In: Conference on Advanced Visual Interfaces (AVI 2004), Gallipoli, Italy, pp. 399–402, May 2004

    Google Scholar 

  4. Brand, M.: Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 1999), New York, NY, USA, pp. 21–28 (1999)

    Google Scholar 

  5. Burmania, A., Parthasarathy, S., Busso, C.: Increasing the reliability of crowdsourcing evaluations using online quality assessment. IEEE Transactions on Affective Computing 7(4), 374–388 (2016)

    Article  Google Scholar 

  6. Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech and Language Processing 15(3), 1075–1086 (2007)

    Article  Google Scholar 

  7. Busso, C., Narayanan, S.: Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Transactions on Audio, Speech and Language Processing 15(8), 2331–2347 (2007)

    Article  Google Scholar 

  8. Busso, C., Narayanan, S.: Interplay between linguistic and affective goals in facial expression during emotional utterances. In: 7th International Seminar on Speech Production (ISSP 2006), Ubatuba-SP, Brazil, pp. 549–556, December 2006

    Google Scholar 

  9. Busso, C., Narayanan, S.: Scripted dialogs versus improvisation: lessons learned about emotional elicitation techniques from the IEMOCAP database. In: Interspeech 2008 - Eurospeech, Brisbane, Australia, pp. 1670–1673, September 2008

    Google Scholar 

  10. Cao, Y., Tien, W., Faloutsos, P., Pighin, F.: Expressive speech-driven facial animation. ACM Transactions on Graphics 24(4), 1283–1302 (2005)

    Article  Google Scholar 

  11. Choi, K., Luo, Y., Hwang, J.: Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. The Journal of VLSI Signal Processing 29(1–2), 51–61 (2001)

    Article  Google Scholar 

  12. Cohen, M.M., Massaro, D.W.: Modeling coarticulation in synthetic visual speech. In: Magnenat-Thalmann, N., Thalmann, D. (eds.) Models and Techniques in Computer Animation, pp. 139–156. Springer Verlag, Tokyo (1993)

    Chapter  Google Scholar 

  13. Ding, Y., Pelachaud, C., Artières, T.: Modeling multimodal behaviors from speech prosody. In: Aylett, R., Krenn, B., Pelachaud, C., Shimodaira, H. (eds.) IVA 2013. LNCS (LNAI), vol. 8108, pp. 217–228. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40415-3_19

    Chapter  Google Scholar 

  14. Eyben, F., Scherer, K., Schuller, B., Sundberg, J., André, E., Busso, C., Devillers, L., Epps, J., Laukka, P., Narayanan, S., Truong, K.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing 7(2), 190–202 (2016)

    Article  Google Scholar 

  15. Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional LSTM. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2015), Brisbane, Australia, pp. 4884–4888, April 2015

    Google Scholar 

  16. Fan, B., Xie, L., Yang, S., Wang, L., Soong, F.K.: A deep bidirectional LSTM approach for video-realistic talking head. Multimedia Tools and Applications 75(9), 5287–5309 (2016)

    Article  Google Scholar 

  17. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics (AISTATS 2010), Sardinia, Italy, pp. 249–256, May 2010

    Google Scholar 

  18. Gutierrez-Osuna, R., Kakumanu, P., Esposito, A., Garcia, O., Bojorquez, A., Castillo, J., Rudomin, I.: Speech-driven facial animation with realistic dynamics. IEEE Transactions on Multimedia 7(1), 33–42 (2005)

    Article  Google Scholar 

  19. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  20. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, San Diego, CA, USA, pp. 1–13, May 2015

    Google Scholar 

  21. Li, X., Wu, Z., Meng, H., Jia, J., Lou, X., Cai, L.: Expressive speech driven talking avatar synthesis with DBLSTM using limited amount of emotional bimodal data. In: Interspeech 2016, San Francisco, CA, USA, pp. 1477–1481, September 2016

    Google Scholar 

  22. Mana, N., Pianesi, F.: HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads. In: International Conference on Multimodal Interfaces (ICMI 2006), Banff, AB, Canada, pp. 380–387, November 2006

    Google Scholar 

  23. Mariooryad, S., Busso, C.: Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Transactions on Audio, Speech and Language Processing 20(8), 2329–2340 (2012)

    Article  Google Scholar 

  24. Mariooryad, S., Busso, C.: Feature and model level compensation of lexical content for facial emotion recognition. In: IEEE International Conference on Automatic Face and Gesture Recognition (FG 2013), Shanghai, China, pp. 1–6, April 2013

    Google Scholar 

  25. Marsella, S., Xu, Y., Lhommet, M., Feng, A., Scherer, S., Shapiro, A.: Virtual character performance from speech. In: ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA 2013), Anaheim, CA, USA, pp. 25–35, July 2013

    Google Scholar 

  26. Mehrabian, A.: Communication without words. In: Mortensen, C. (ed.) Communication Theory, pp. 193–200. Transaction Publishers, New Brunswick (2007)

    Google Scholar 

  27. Pelachaud, C., Badler, N., Steedman, M.: Generating facial expressions for speech. Cognitive Science 20(1), 1–46 (1996)

    Article  Google Scholar 

  28. Sadoughi, N., Liu, Y., Busso, C.: Speech-driven animation constrained by appropriate discourse functions. In: International conference on multimodal interaction (ICMI 2014), Istanbul, Turkey, pp. 148–155, November 2014

    Google Scholar 

  29. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  30. Taylor, S., Kato, A., Matthews, I., Milner, B.: Audio-to-visual speech conversion using deep neural networks. In: Interspeech 2016, San Francisco, CA, USA, pp. 1482–1486, September 2016

    Google Scholar 

  31. Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M., Schuller, B., Zafeiriou, S.: Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), Shanghai, China, pp. 5200–5204, March 2016

    Google Scholar 

  32. Xu, Y., Feng, A.W., Marsella, S., Shapiro, A.: A practical and configurable lip sync method for games. In: Motion in Games (MIG 2013), Dublin, Ireland, pp. 131–140, November 2013

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Najmeh Sadoughi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Sadoughi, N., Busso, C. (2017). Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory. In: Beskow, J., Peters, C., Castellano, G., O'Sullivan, C., Leite, I., Kopp, S. (eds) Intelligent Virtual Agents. IVA 2017. Lecture Notes in Computer Science(), vol 10498. Springer, Cham. https://doi.org/10.1007/978-3-319-67401-8_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67401-8_49

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67400-1

  • Online ISBN: 978-3-319-67401-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics