Deep Learning Based Video Spatio-Temporal Modeling for Emotion Recognition

Fonnegra, Rubén D.; Díaz, Gloria M.

doi:10.1007/978-3-319-91238-7_32

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10901))

Included in the following conference series:

International Conference on Human-Computer Interaction

3578 Accesses
5 Citations

Abstract

Affective Computing is a growing research area, which aims to determine the emotional user states through their conscious and unconscious actions and use it to modify the machine interaction. This paper investigates the discriminative abilities of convolutional and recurrent neural networks to modeling spatio-temporal features from video sequences of the face region. In a deep learning architecture, dense convolutional layers are used for analyzing spatial information changes in frames during short time periods, while dense recurrent layers are used to model changes in frames as temporal sequences that change across the time. Those layers are then connected to a multilayer perceptron (MLP) to perform the classification task, which consists in to distinguish between six different emotion categories. The performance was twofold evaluated: gender independent and gender-dependent classifications. Experimental results show that the proposed approach achieves an accuracy of $81.84\%$, in the gender independent experiment, which outperforms previous works using the same experimental data. In the gender-dependent experiment, accuracy was $80.79\%$ and $82.75\%$ for male and female, respectively.

You have full access to this open access chapter, Download conference paper PDF

Emotion Recognition Based on Streaming Real-Time Video with Deep Learning Approach

Survey of deep emotion recognition in dynamic data using facial, speech and textual cues

Article 22 January 2024

Statistical Prediction of Facial Emotions Using Mini Xception CNN and Time Series Analysis

Keywords

1 Introduction

The last decade has witnessed a tremendous development of interaction devices for different applications concerning entertainment fields [1], such as interactive video games, immersive environments, virtual and augmented reality, among others [2,3,4,5]. One of the most relevant study fields regarding human-computer interactions is the Affective Computing, which is intended to determine human emotional states to improve the interaction with machines. The study of emotional states has been studied from different information sources; such as biological signals [6], speech signals [7], still images and video sequences [8, 9]. However, the use visual features of facial expressions is the most explored modality in the literature [10, 11].

Emotion recognition based on facial expressions attempts to identify and characterize changes in motion units (in this case, those concerning to facial muscles), which allows representing the emotional expression of a person [12]. In this sense, two main strategies have been tackled, the first one focused on the characteristics of motion units that delineate facial patterns, which are detected in still images [13], and the second one attempts to describe the way in which those expressions change across image sequences [14]. Despite the high performance reported by some works that use still images, the dynamic of facial expressions had shown that can to achieves better performance [15]. However, development of visual descriptors that represent discriminative characteristics associated with different emotions is an open research challenge [16].

On the other hand, the recent development of high-performance specialized processing units (such as graphical GPUs) has increased the use of deep learning techniques based on artificial neural networks. Since 2012, when deep learning approaches outperformed classical computer vision algorithms in a very challenging task in 2012, i.e., classification 15 million images-within 22000 categories [17], this strategy has been used to solved several problems in many applications such as speech recognition, image classification, sequence modeling, among others. Specifically, the use of convolutional layers into the network architectures had shown be useful in the extraction of high-level spatial image representations without requiring to use classical preprocessing algorithms for featuring images.

In this work, we present an effective and robust emotion recognition system based on the analysis of video sequences, which allows analyzing patterns of movements of facial expressions during short time periods. For doing so, a deep learning architecture was designed to process video sequences and to identify between six emotions to be recognized. Convolutional layers are used to extract high-level representations of facial expressions across frames, and recurrent layers are used to model changes of the spatial characteristics extracted from the convolutional layers across time. Then, a multilayer perceptron classifies the sequence into one of the six possible emotions. The performance was evaluated using the well known eNTERFACE database, which was processed to obtain a broad set of training and evaluation data. Thus, video sequences were split into sets of one second, according to frames-per-seconds acquisition parameters.

This paper is organized as follows: in the Sect. 2, a brief summary of the related works described in the literature, which use the eNTERFACE database is presented. Section 3 describes the proposed spatiotemporal modeling based on deep neural network architectures. Besides, there are described the video preprocessing stage for videos, and the specific aspects of the proposed neural network architecture, including the number of layers, their main parameters, and stochastic optimizer. In Sect. 4 the experimental results are presented, and finally, in Sect. 5, the conclusions and future work are discussed.

2 Previous Works

Strategies concerning emotion recognition for affective computing models have been an active research field in the last years, because of a large amount of application, where the emotions can affect human’s behavior and actions. State of the art presents several strategies for recognizing emotions [18]. However, those based on the analysis of facial expressions have been one of the most used to achieve this aim, which still represents important challenges such as illumination changes, pose variations (in still images [19]) and spatiotemporal modeling (in video or image sequences [20]). A comprehensive review of proposed approaches can be consulted in [21, 22]. In this section, the most recent published works aiming to develop methods for recognizing emotions in the eNTERFACE’05 database will be presented to provide a comparison baseline.

Dobrivsek et al. [23] present an emotion recognition framework from video sequences, composed of different subsystems: the first subsystem implements a preprocessing stage, in which face is automatically detected and the image pixels are normalized using a histogram equalization approach. In the second subsystem, a dimensional subspace from preprocessed images to construct a prototypical template for the N emotions is performed by a simple eigen-decomposition of each emotion scatter matrix. In the third subsystem, a matching stage is applied for comparing encoding representations from faces, using a canonical correlation analysis, via singular value decomposition (SVD). Performance obtained using this strategy is highly dependent on the dimensionality of subspaces created for encoding the images, which increases the computational complexity of the algorithm.

Zhalehpour et al. [24] propose a strategy to select those frames from the original video in which the emotion expressiveness achieves the highest point. Then, those frames are processed as still images. Authors achieve the task by grouping all the video frames into K clusters based on dissimilarity between the local phase quantization (LPQ) features in each frame. They use a complete agglomerative link clustering algorithm, named dendrogram clustering algorithm (DEND-CLUSTER) to group the frames and then, select a frame whose average distance from the rest of the frames in the cluster is minimum. Thus, they compute the ideal selection measure (ISM) score based on gradient of each pixel, to arrange the selected frames in descendent order. The peak frames features are used for training a SVM with a linear kernel to classify emotions. This approach did not consider dynamic motion change of the face, which it is an expected condition in a real context.

Poria et al. [25] propose a model for emotion classification from visual data using an extreme learning machine. In their strategy they first make a manual annotation of every frame in the video, to subtract those frames with no relevant emotional content (neutral). Then, they perform a facial extraction of relevant points of the face (eyes, mouth, nose, eyebrows and chin) using the Luxand FSDK software to obtain a vector of features per frame. The feature vector for each video is obtained then, using coordinate - wise averaging from feature vectors of individual frames. Finally, they use a extreme learning machine (ELM) to perform classification task of video clips. Despite authors achieved a high performance with their strategy ($81.21\%$), the main drawback lays on the manual annotation of the frames because it might elicit a significantly loss of frames with relevant emotional content to achieve the recognition.

In other works such as [26, 27] propose emotion recognition systems that achieves significant performances. However, their restriction consists in the limited amount of samples for the classifiers; which bias the generalization models. In [28] is presented a model using big data towards 5G. This model includes different datasets for emotion classification task and achieves a significant performance compared with another state of the art strategies, reaching accuracies between $40.78\%$ (disgust) and $76.9\%$ (happy).

3 Materials and Methods

Figure 1 illustrates a graphical overview of the proposed approach, which is mainly divided into three stages: the first one, named preparation data, concerns to applying some preprocessing steps to the original data in the database, so they can be used for training and evaluating a deep learning based classification model. The second one concerns to training the deep learning classification model based on the spatio-temporal analysis of data, which aims to recognize emotions. And the last, the testing stage, in which the trained model is used for deciding the emotion from a new video (not used in the training stage).

3.1 Data Preparation

With the aim of reaching a valid experimentation stage comparable with previous works, the proposed approach was evaluated using a public, available and widely used dataset, i.e. the eNTERFACE’05 database [29]. It is a bimodal database, which contains video and audio signals of subjects expressing affective sentences in English language. The videos were acquired using a camera with a 25 fps sampling rate. 43 different non-professional actors (35 men and 13 women) coming from 14 different countries, which were asked to express emotions through specific sentences (five sentences) for six different emotions (anger, fear, disgust, happiness, surprise, and sadness). The database is composed of a total of 1290 bimodal samples, corresponding to five sentences per emotion (six emotions) for each of the 43 subjects. At this point, we separate audio signals from video frames and then, the audio signals were discarded. Some of the frames extracted from those image sequences are shown in Fig. 2.

Thus, 1290 image sequences were extracted and processed as follows in order to prepare data for be used in the experimental stage. Initially, a grayscale transformation was applied; then, original samples were processed for obtaining a dataset in which each instance contained the same number of frames, which is required to be processed by the deep learning model. Here, subsamples of one second were generated by splitting the original videos, without overlap. Additionally, because deep learning strategies requires a big amount of data for achieving optimal training parameters, a data augmentation strategy was applied by flipping every single frame in the sample, vertically. Additionally, four subjects in the database were discarded, due to lack of samples. Thus, a new experimental dataset was generated, which was composed of 5444 image sequences of 25 frames each.

3.2 Deep Learning Based Emotion Recognition from Image Sequences

Spatial and temporal relations from video samples described in Subsect. 3.1 were modeled using a deep learning architecture, which combines dense convolutional layers (for spatial information analysis) and recurrent units (for temporal modeling of data). The dense convolutional layer allows to extract spatial features from every sample. Given, each sample is compared with a sequence of still images, the convolution layers will learn main patterns of movements across the frames when emotions are expressed, and propagate them through the network layers. The dense convolutional layer contains 768 convolutional filters with size $3\times 3$, a full padding and stride size of $2\times 2$. To maximize extraction of spatial information in the network units during the training stage, the selected activation function for the convolutional layer was the rectified linear units (ReLU) [30], which had demonstrated to generate more expressive models than classical activation function ([31, 32]).

After convolutional layer, a dense recurrent layer was included to model the time dimension of video frames. So, spatial information extracted from the convolutional layers was considered, to be modeled as a sequence of temporal data which changes across the time. For doing so, a gated recurrent units (GRU cells) layer was implemented. GRU layer was used over Long-Short Term Memory units (LSTM cells - widely used in sequential models) due to they modulate the flow of information inside the unit, without having separate memory cells. This characteristic makes the GRU cell computationally more efficient. Besides, these units avoid vanishing gradient problem since the cell bypass multiple time steps, allowing the error to back-propagate easily [33]. The dense GRU layer contains 512 units with backward sequence processing (this means the cell take the sequence backward and then, reverse the output again), and a gradient clipping of 1 (to minimize computational cost).

Finally, a multilayer perceptron (MLP) was implemented in order to perform classification task. In this sense, the MLP was composed of 2 layers, with 64 units each, including a ReLU activation function. The output layer contains as many neurons as the number of emotions, with a softmax activation function, which allows to create a probabilistic density function of classes. The proposed approach is graphically illustrated in Fig. 1.

3.3 Adaptive Moment Estimation (Adam) Optimizer

The network parameters were optimized during the training stage with the stochastic gradient descent based algorithm, Adaptive Moment Estimation (Adam) [34]. Adam optimizer consists of gradients updating based on momentums (first moment - mean, and second moment - variance). This technique allows adaptive single parameter tuning (such as Adagrad [35] and RMSprop [36]), but considering gradients initialization and small decaying rates during the training stage. These both considerations significantly improve parameters optimization, increase performance and avoiding divergence in the network. Adam moment estimation (mean and variance) and optimization rules are detailed described in Eqs. (1), (2) and (3) respectively, where constant values for optimizer during training stage were set to $\beta _1 = 0.9, \beta _2 = 0.999, \alpha = 0.001, \varepsilon = 10^{-8}$ (being $\alpha =$ Learning rate).

$$\begin{aligned} m_t = \beta _1 m_{t-1} + (1 - \beta _1)g_t \end{aligned}$$

(1)

$$\begin{aligned} v_t = \beta _2 v_{t-1} + (1 - \beta _2)g^2 _t \end{aligned}$$

(2)

$$\begin{aligned} \theta _{t+1} = \theta _t - \frac{\alpha }{\root \of {v_t} + \epsilon } m_t \end{aligned}$$

(3)

4 Performance Evaluation

4.1 Experimental Setup

Proposed approach was twofold evaluated; a gender-dependent case and a gender-independent case. In the gender-dependent case, the learning model was trained using samples from subjects of the same gender, i.e. males and females, separately. In the gender independent case, whole samples were merged in a unique dataset.

According to the last, in the gender-independent case, the experimental dataset was composed of 5444 samples (as was described in Sect. 3.1). So, for each fold, 4355 samples were used for training and 1089 for testing the model. In the gender-dependent case, two experiments were performed male-dependent and female-dependent. In the male-dependent were considered 31 subjects found in the database, from which were extracted 4366 samples, allowing to take 3492 for training and 874 for testing. In the female-dependent, only 8 subjects were found in the database, from which were extracted 1220 samples; 976 for the training and 244 for the testing.

In both cases, the evaluation was performed using an iterative 5-fold cross-validation strategy. For each fold, a learning model was trained in 50 epochs, obtaining a testing accuracy (Eq. 4) and a confusion matrix. Then, a the mean of the five folds were computed and reported as overall accuracy and overall confusion matrix.

$$\begin{aligned} Accuracy = \frac{Correct \,\, predicted \,\, emotions \,\, (hits)}{Amount \,\, of \,\, samples} \end{aligned}$$

(4)

4.2 Experimental Results

Gender Dependent Case. Overall accuracies of 0.8275 and 0.8079 were obtained for female-dependent and male-dependent cases. Performance differences should be read with caution because the number of samples used for training each learning model. Likewise, Tables 1 and 2 report the confusion matrices for female-dependent and male-dependent cases, respectively. In the male-dependent case, similar results were reported for all the emotions, only the surprise reports a small performance (0.7086). On the other hand, in the female-dependent case, large differences were found. Sadness and surprise are clearly more distinguishable than disgust and fear. In this case, fear is confused with disgust, while disgust is more confused with the anger emotion. This could be expected, considering that female and male facial characteristics differ one from another.

Table 1. Confusion Matrix for emotion recognition in female.

Full size table

Table 2. Confusion Matrix for emotion recognition in male.

Full size table

Gender Independent Case. In the experiment where the gender of the samples was not considered; the proposed approach obtained an overall accuracy of 0.8184. The confusion matrix is shown in the Table 3. In this case, emotions with biggest recognition rate were disgust and surprise, obtaining 0.8620 and 0.8410 respectively, while fear and anger were most confused, showing a recognition rate of 0.7796 and 0.7881, respectively.

Table 3. Confusion Matrix for emotion recognition in gender independent case.

Full size table

In order to compare experimental results with those found in state of the art, Table 4 shows the most relevant studies for facial emotion recognition, using the eNTERFACE database as was specified in Sect. 2. As can be observed the proposed approach outperforms reported results in emotion recognition in a gender independent classification.

Table 4. Performance comparison with related works in the state of the art.

Full size table

5 Conclusions and Future Work

In this work, we propose an effective framework for emotion recognition from video, which exploits convolutional recurrent layers in a deep learning architecture. We proposed a video emotion recognition using dynamic spatio-temporal modeling based on deep neural networks that allows to distinguish between six different emotions (anger, disgust, fear, happiness, sadness, and surprise). Learning model was designed with three main components, the first component was a deep learning strategy specially designed to perform spatial analysis from the video frames through dense convolutional layer; the second was a dense GRU layer, which aims to model time changes of the spatial information extracted in the first one. The last one was a dense multilayer perceptron, which performed the classification task. The performance was evaluated in gender-dependent and gender-independent cases, using the eNTERFACE database, which enables a comparison with state-of-the-art methods. According with the results, the proposed approach outperform previous works in the gender-independent case. Previous works reporting gender-dependent evaluation were not found.

As future work, it is required to evaluate the real effect of training specific classification models according to subject gender. For doing so, an homogeneous database is required. Additionally, other aspects such as Language, country origin and age could be also considered. On the other hand, we pretend to explore the use of our strategy in real time models using virtual gadgets and avatars for emotion recognition. Additionally, we pretend to continue exploring spatio-temporal modeling using deep learning approaches for multimodal information fusion analysis.

References

Blascovich, J., Bailenson, J.: Infinite reality: Avatars, Eternal Life, New Worlds, and the Dawn of the Virtual Revolution. William Morrow & Co., New York (2011)
Google Scholar
Balducci, F., Grana, C., Cucchiara, R.: Affective level design for a role-playing videogame evaluated by a brain-computer interface and machine learning methods. Visual Comput. 33(4), 413–427 (2017)
Article Google Scholar
Bartsch, A., Hartmann, T.: The role of cognitive and affective challenge in entertainment experience. Commun. Res. 44(1), 29–53 (2017)
Article Google Scholar
Corneanu, C.A., Simón, M.O., Cohn, J.F., Guerrero, S.E.: Survey on rgb, 3D, thermal, and multimodal approaches for facial expression recognition: history, trends, and affect-related applications. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1548–1568 (2016)
Article Google Scholar
Zhou, X., Shen, W.: Research on interactive device ergonomics designed for elderly users in the human-computer interaction. Int. J. Smart Home 10(2), 49–62 (2016)
Article Google Scholar
Bernal, G., Maes, P.: Emotional beasts: visually expressing emotions through avatars in VR. In: Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. 2395–2402. ACM (2017)
Google Scholar
Yan, J., Zheng, W., Xu, Q., Lu, G., Li, H., Wang, B.: Sparse kernel reduced-rank regression for bimodal emotion recognition from facial expression and speech. IEEE Trans. Multimed. 18(7), 1319–1329 (2016)
Article Google Scholar
Mavridou, I., McGhee, J.T., Hamedi, M., Fatoorechi, M., Cleal, A., Ballaguer-Balester, E., Seiss, E., Cox, G., Nduka, C.: FACETEQ interface demo for emotion expression in VR. In: 2017 IEEE Virtual Reality (VR), pp. 441–442. IEEE (2017)
Google Scholar
Bekele, E., Bian, D., Peterman, J., Park, S., Sarkar, N.: Design of a virtual reality system for affect analysis in facial expressions (VR-saafe); application to schizophrenia. IEEE Trans. Neural Syst. Rehabil. Eng. 25(6), 739–749 (2017)
Article Google Scholar
Marrero-Fernández, P., Montoya-Padrón, A., i Capó, A.J., Rubio, J.M.B.: Evaluating the research in automatic emotion recognition. IETE Tech. Rev. 31(3), 220–232 (2014)
Article Google Scholar
Goyal, S.J., Upadhyay, A.K., Jadon, R.S., Goyal, R.: Real-life facial expression recognition systems: a review. In: Satapathy, S.C., Bhateja, V., Das, S. (eds.) Smart Computing and Informatics. SIST, vol. 77, pp. 311–331. Springer, Singapore (2018). https://doi.org/10.1007/978-981-10-5544-7_31
Chapter Google Scholar
Lien, J.J., Kanade, T., Cohn, J.F., Li, C.C.: Automated facial expression recognition based on FACS action units. In: 1998 Third IEEE International Conference on Automatic Face and Gesture Recognition, Proceedings, pp. 390–395. IEEE (1998)
Google Scholar
Cheng, F., Yu, J., Xiong, H.: Facial expression recognition in jaffe dataset based on gaussian process classification. IEEE Trans. Neural Netw. 21(10), 1685–1690 (2010)
Article Google Scholar
Ji, Q., Moeslund, T.B., Hua, G., Nasrollahi, K. (eds.): FFER 2014. LNCS, vol. 8912. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-13737-7
Book Google Scholar
Fasel, B., Luettin, J.: Automatic facial expression analysis: a survey. Pattern Recogn. 36(1), 259–275 (2003)
Article Google Scholar
Deldjoo, Y., Elahi, M., Cremonesi, P., Garzotto, F., Piazzolla, P., Quadrana, M.: Content-based video recommendation system based on stylistic visual features. J. Data Seman. 5(2), 99–113 (2016)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)
Google Scholar
Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing: from unimodal analysis to multimodal fusion. Inf. Fusion 37, 98–125 (2017)
Article Google Scholar
Wang, S.H., Phillips, P., Dong, Z.C., Zhang, Y.D.: Intelligent facial emotion recognition based on stationary wavelet entropy and Jaya algorithm. Neurocomputing (2017)
Google Scholar
Yan, H.: Collaborative discriminative multi-metric learning for facial expression recognition in video. Pattern Recogn. (2017)
Google Scholar
Mühl, C., Allison, B., Nijholt, A., Chanel, G.: A survey of affective brain computer interfaces: principles, state-of-the-art, and challenges. Brain-Comput. Interfaces 1(2), 66–84 (2014)
Article Google Scholar
Wang, S., Ji, Q.: Video affective content analysis: a survey of state-of-the-art methods. IEEE Trans. Affect. Comput. 6(4), 410–430 (2015)
Article Google Scholar
Dobrišek, S., Gajšek, R., Mihelič, F., Pavešić, N., Štruc, V.: Towards efficient multi-modal emotion recognition. Int. J. Adv. Rob. Syst. 10(1), 53 (2013)
Article Google Scholar
Zhalehpour, S., Akhtar, Z., Erdem, C.E.: Multimodal emotion recognition based on peak frame selection from video. Signal Image Video Process. 10(5), 827–834 (2016)
Article Google Scholar
Poria, S., Cambria, E., Hussain, A., Huang, G.B.: Towards an intelligent framework for multimodal affective data analysis. Neural Netw. 63, 104–116 (2015)
Article Google Scholar
Rashid, M., Abu-Bakar, S., Mokji, M.: Human emotion recognition from videos using spatio-temporal and audio features. Visual Comput. 29(12), 1269–1275 (2013)
Article Google Scholar
Huang, K.C., Huang, S.Y., Kuo, Y.H.: Emotion recognition based on a novel triangular facial feature extraction method. In: The 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–6. IEEE (2010)
Google Scholar
Hossain, M.S., Muhammad, G., Alhamid, M.F., Song, B., Al-Mutib, K.: Audio-visual emotion recognition using big data towards 5G. Mobile Netw. Appl. 21(5), 753–763 (2016)
Article Google Scholar
Martin, O., Kotsia, I., Macq, B., Pitas, I.: The eNTERFACE’05 audio-visual emotion database. In: 2006 22nd International Conference on Data Engineering Workshops, Proceedings, p. 8. IEEE (2006)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)
Google Scholar
Jarrett, K., Kavukcuoglu, K., LeCun, Y., et al.: What is the best multi-stage architecture for object recognition? In: 2009 IEEE 12th International Conference on Computer Vision, pp. 2146–2153. IEEE (2009)
Google Scholar
Dahl, G.E., Sainath, T.N., Hinton, G.E.: Improving deep neural networks for LVCSR using rectified linear units and dropout. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8609–8613. IEEE (2013)
Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)
MathSciNet MATH Google Scholar
Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach. Learn. 4(2), 26–31 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Instituto Tecnológico Metropolitano, Medellín, Colombia
Rubén D. Fonnegra & Gloria M. Díaz

Authors

Rubén D. Fonnegra
View author publications
You can also search for this author in PubMed Google Scholar
Gloria M. Díaz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gloria M. Díaz .

Editor information

Editors and Affiliations

The Open University of Japan, Chiba, Japan
Masaaki Kurosu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fonnegra, R.D., Díaz, G.M. (2018). Deep Learning Based Video Spatio-Temporal Modeling for Emotion Recognition. In: Kurosu, M. (eds) Human-Computer Interaction. Theories, Methods, and Human Issues. HCI 2018. Lecture Notes in Computer Science(), vol 10901. Springer, Cham. https://doi.org/10.1007/978-3-319-91238-7_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-91238-7_32
Published: 01 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91237-0
Online ISBN: 978-3-319-91238-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Learning Based Video Spatio-Temporal Modeling for Emotion Recognition

Abstract

Similar content being viewed by others

Emotion Recognition Based on Streaming Real-Time Video with Deep Learning Approach

Survey of deep emotion recognition in dynamic data using facial, speech and textual cues

Statistical Prediction of Facial Emotions Using Mini Xception CNN and Time Series Analysis

Keywords

1 Introduction

2 Previous Works

3 Materials and Methods

3.1 Data Preparation

3.2 Deep Learning Based Emotion Recognition from Image Sequences

3.3 Adaptive Moment Estimation (Adam) Optimizer

4 Performance Evaluation

4.1 Experimental Setup

4.2 Experimental Results

5 Conclusions and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Deep Learning Based Video Spatio-Temporal Modeling for Emotion Recognition

Abstract

Similar content being viewed by others

Emotion Recognition Based on Streaming Real-Time Video with Deep Learning Approach

Survey of deep emotion recognition in dynamic data using facial, speech and textual cues

Statistical Prediction of Facial Emotions Using Mini Xception CNN and Time Series Analysis

Keywords

1 Introduction

2 Previous Works

3 Materials and Methods

3.1 Data Preparation

3.2 Deep Learning Based Emotion Recognition from Image Sequences

3.3 Adaptive Moment Estimation (Adam) Optimizer

4 Performance Evaluation

4.1 Experimental Setup

4.2 Experimental Results

5 Conclusions and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation