Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory

Sadoughi, Najmeh; Busso, Carlos

doi:10.1007/978-3-319-67401-8_49

Najmeh Sadoughi²¹ &
Carlos Busso²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10498))

Included in the following conference series:

International Conference on Intelligent Virtual Agents

3534 Accesses
8 Citations

Abstract

The face conveys a blend of verbal and nonverbal information playing an important role in daily interaction. While speech articulation mostly affects the orofacial areas, emotional behaviors are externalized across the entire face. Considering the relation between verbal and nonverbal behaviors is important to create naturalistic facial movements for conversational agents (CAs). Furthermore, facial muscles connect areas across the face, creating principled relationships and dependencies between the movements that have to be taken into account. These relationships are ignored when facial movements across the face are separately generated. This paper proposes to create speech-driven models that jointly capture the relationship not only between speech and facial movements, but also across facial movements. The input to the models are features extracted from speech that convey the verbal and emotional states of the speakers. We build our models with bidirectional long-short term memory (BLSTM) units which are shown to be very successful in modeling dependencies for sequential data. The objective and subjective evaluations of the results demonstrate the benefits of joint modeling of facial regions using this framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Albrecht, I., Schröder, M., Haber, J., Seidel, H.P.: Mixed feelings: expression of non-basic emotions in a muscle-based talking head. Virtual Reality 8(4), 201–212 (2005)
Article Google Scholar
Anderson, R., Stenger, B., Wan, V., Cipolla, R.: Expressive visual text-to-speech using active appearance models. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013), Portland, OR, pp. 3382–3389, June 2013
Google Scholar
Balci, K.: Xface: MPEG-4 based open source toolkit for 3D facial animation. In: Conference on Advanced Visual Interfaces (AVI 2004), Gallipoli, Italy, pp. 399–402, May 2004
Google Scholar
Brand, M.: Voice puppetry. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 1999), New York, NY, USA, pp. 21–28 (1999)
Google Scholar
Burmania, A., Parthasarathy, S., Busso, C.: Increasing the reliability of crowdsourcing evaluations using online quality assessment. IEEE Transactions on Affective Computing 7(4), 374–388 (2016)
Article Google Scholar
Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech and Language Processing 15(3), 1075–1086 (2007)
Article Google Scholar
Busso, C., Narayanan, S.: Interrelation between speech and facial gestures in emotional utterances: a single subject study. IEEE Transactions on Audio, Speech and Language Processing 15(8), 2331–2347 (2007)
Article Google Scholar
Busso, C., Narayanan, S.: Interplay between linguistic and affective goals in facial expression during emotional utterances. In: 7th International Seminar on Speech Production (ISSP 2006), Ubatuba-SP, Brazil, pp. 549–556, December 2006
Google Scholar
Busso, C., Narayanan, S.: Scripted dialogs versus improvisation: lessons learned about emotional elicitation techniques from the IEMOCAP database. In: Interspeech 2008 - Eurospeech, Brisbane, Australia, pp. 1670–1673, September 2008
Google Scholar
Cao, Y., Tien, W., Faloutsos, P., Pighin, F.: Expressive speech-driven facial animation. ACM Transactions on Graphics 24(4), 1283–1302 (2005)
Article Google Scholar
Choi, K., Luo, Y., Hwang, J.: Hidden Markov model inversion for audio-to-visual conversion in an MPEG-4 facial animation system. The Journal of VLSI Signal Processing 29(1–2), 51–61 (2001)
Article Google Scholar
Cohen, M.M., Massaro, D.W.: Modeling coarticulation in synthetic visual speech. In: Magnenat-Thalmann, N., Thalmann, D. (eds.) Models and Techniques in Computer Animation, pp. 139–156. Springer Verlag, Tokyo (1993)
Chapter Google Scholar
Ding, Y., Pelachaud, C., Artières, T.: Modeling multimodal behaviors from speech prosody. In: Aylett, R., Krenn, B., Pelachaud, C., Shimodaira, H. (eds.) IVA 2013. LNCS (LNAI), vol. 8108, pp. 217–228. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40415-3_19
Chapter Google Scholar
Eyben, F., Scherer, K., Schuller, B., Sundberg, J., André, E., Busso, C., Devillers, L., Epps, J., Laukka, P., Narayanan, S., Truong, K.: The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing 7(2), 190–202 (2016)
Article Google Scholar
Fan, B., Wang, L., Soong, F.K., Xie, L.: Photo-real talking head with deep bidirectional LSTM. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2015), Brisbane, Australia, pp. 4884–4888, April 2015
Google Scholar
Fan, B., Xie, L., Yang, S., Wang, L., Soong, F.K.: A deep bidirectional LSTM approach for video-realistic talking head. Multimedia Tools and Applications 75(9), 5287–5309 (2016)
Article Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics (AISTATS 2010), Sardinia, Italy, pp. 249–256, May 2010
Google Scholar
Gutierrez-Osuna, R., Kakumanu, P., Esposito, A., Garcia, O., Bojorquez, A., Castillo, J., Rudomin, I.: Speech-driven facial animation with realistic dynamics. IEEE Transactions on Multimedia 7(1), 33–42 (2005)
Article Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)
Article Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, San Diego, CA, USA, pp. 1–13, May 2015
Google Scholar
Li, X., Wu, Z., Meng, H., Jia, J., Lou, X., Cai, L.: Expressive speech driven talking avatar synthesis with DBLSTM using limited amount of emotional bimodal data. In: Interspeech 2016, San Francisco, CA, USA, pp. 1477–1481, September 2016
Google Scholar
Mana, N., Pianesi, F.: HMM-based synthesis of emotional facial expressions during speech in synthetic talking heads. In: International Conference on Multimodal Interfaces (ICMI 2006), Banff, AB, Canada, pp. 380–387, November 2006
Google Scholar
Mariooryad, S., Busso, C.: Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Transactions on Audio, Speech and Language Processing 20(8), 2329–2340 (2012)
Article Google Scholar
Mariooryad, S., Busso, C.: Feature and model level compensation of lexical content for facial emotion recognition. In: IEEE International Conference on Automatic Face and Gesture Recognition (FG 2013), Shanghai, China, pp. 1–6, April 2013
Google Scholar
Marsella, S., Xu, Y., Lhommet, M., Feng, A., Scherer, S., Shapiro, A.: Virtual character performance from speech. In: ACM SIGGRAPH/Eurographics Symposium on Computer Animation (SCA 2013), Anaheim, CA, USA, pp. 25–35, July 2013
Google Scholar
Mehrabian, A.: Communication without words. In: Mortensen, C. (ed.) Communication Theory, pp. 193–200. Transaction Publishers, New Brunswick (2007)
Google Scholar
Pelachaud, C., Badler, N., Steedman, M.: Generating facial expressions for speech. Cognitive Science 20(1), 1–46 (1996)
Article Google Scholar
Sadoughi, N., Liu, Y., Busso, C.: Speech-driven animation constrained by appropriate discourse functions. In: International conference on multimodal interaction (ICMI 2014), Istanbul, Turkey, pp. 148–155, November 2014
Google Scholar
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1929–1958 (2014)
MathSciNet MATH Google Scholar
Taylor, S., Kato, A., Matthews, I., Milner, B.: Audio-to-visual speech conversion using deep neural networks. In: Interspeech 2016, San Francisco, CA, USA, pp. 1482–1486, September 2016
Google Scholar
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M., Schuller, B., Zafeiriou, S.: Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2016), Shanghai, China, pp. 5200–5204, March 2016
Google Scholar
Xu, Y., Feng, A.W., Marsella, S., Shapiro, A.: A practical and configurable lip sync method for games. In: Motion in Games (MIG 2013), Dublin, Ireland, pp. 131–140, November 2013
Google Scholar

Download references

Author information

Authors and Affiliations

Multimodal Signal Processing (MSP) Laboratory, Department of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, USA
Najmeh Sadoughi & Carlos Busso

Authors

Najmeh Sadoughi
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Busso
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Najmeh Sadoughi .

Editor information

Editors and Affiliations

KTH Royal Institute of Technology, Stockholm, Sweden
Jonas Beskow
KTH Royal Institute of Technology, Stockholm, Sweden
Christopher Peters
Uppsala University, Uppsala, Sweden
Ginevra Castellano
Trinity College, Dublin, Ireland
Carol O'Sullivan
KTH Royal Institute of Technology, Stockholm, Sweden
Iolanda Leite
University of Bielefeld, Bielefeld, Germany
Stefan Kopp

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sadoughi, N., Busso, C. (2017). Joint Learning of Speech-Driven Facial Motion with Bidirectional Long-Short Term Memory. In: Beskow, J., Peters, C., Castellano, G., O'Sullivan, C., Leite, I., Kopp, S. (eds) Intelligent Virtual Agents. IVA 2017. Lecture Notes in Computer Science(), vol 10498. Springer, Cham. https://doi.org/10.1007/978-3-319-67401-8_49

Download citation

DOI: https://doi.org/10.1007/978-3-319-67401-8_49
Published: 26 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67400-1
Online ISBN: 978-3-319-67401-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics