Keywords

1 Introduction

The last decade has witnessed a tremendous development of interaction devices for different applications concerning entertainment fields [1], such as interactive video games, immersive environments, virtual and augmented reality, among others [2,3,4,5]. One of the most relevant study fields regarding human-computer interactions is the Affective Computing, which is intended to determine human emotional states to improve the interaction with machines. The study of emotional states has been studied from different information sources; such as biological signals [6], speech signals [7], still images and video sequences [8, 9]. However, the use visual features of facial expressions is the most explored modality in the literature [10, 11].

Emotion recognition based on facial expressions attempts to identify and characterize changes in motion units (in this case, those concerning to facial muscles), which allows representing the emotional expression of a person [12]. In this sense, two main strategies have been tackled, the first one focused on the characteristics of motion units that delineate facial patterns, which are detected in still images [13], and the second one attempts to describe the way in which those expressions change across image sequences [14]. Despite the high performance reported by some works that use still images, the dynamic of facial expressions had shown that can to achieves better performance [15]. However, development of visual descriptors that represent discriminative characteristics associated with different emotions is an open research challenge [16].

On the other hand, the recent development of high-performance specialized processing units (such as graphical GPUs) has increased the use of deep learning techniques based on artificial neural networks. Since 2012, when deep learning approaches outperformed classical computer vision algorithms in a very challenging task in 2012, i.e., classification 15 million images-within 22000 categories [17], this strategy has been used to solved several problems in many applications such as speech recognition, image classification, sequence modeling, among others. Specifically, the use of convolutional layers into the network architectures had shown be useful in the extraction of high-level spatial image representations without requiring to use classical preprocessing algorithms for featuring images.

In this work, we present an effective and robust emotion recognition system based on the analysis of video sequences, which allows analyzing patterns of movements of facial expressions during short time periods. For doing so, a deep learning architecture was designed to process video sequences and to identify between six emotions to be recognized. Convolutional layers are used to extract high-level representations of facial expressions across frames, and recurrent layers are used to model changes of the spatial characteristics extracted from the convolutional layers across time. Then, a multilayer perceptron classifies the sequence into one of the six possible emotions. The performance was evaluated using the well known eNTERFACE database, which was processed to obtain a broad set of training and evaluation data. Thus, video sequences were split into sets of one second, according to frames-per-seconds acquisition parameters.

This paper is organized as follows: in the Sect. 2, a brief summary of the related works described in the literature, which use the eNTERFACE database is presented. Section 3 describes the proposed spatiotemporal modeling based on deep neural network architectures. Besides, there are described the video preprocessing stage for videos, and the specific aspects of the proposed neural network architecture, including the number of layers, their main parameters, and stochastic optimizer. In Sect. 4 the experimental results are presented, and finally, in Sect. 5, the conclusions and future work are discussed.

2 Previous Works

Strategies concerning emotion recognition for affective computing models have been an active research field in the last years, because of a large amount of application, where the emotions can affect human’s behavior and actions. State of the art presents several strategies for recognizing emotions [18]. However, those based on the analysis of facial expressions have been one of the most used to achieve this aim, which still represents important challenges such as illumination changes, pose variations (in still images [19]) and spatiotemporal modeling (in video or image sequences [20]). A comprehensive review of proposed approaches can be consulted in [21, 22]. In this section, the most recent published works aiming to develop methods for recognizing emotions in the eNTERFACE’05 database will be presented to provide a comparison baseline.

Dobrivsek et al. [23] present an emotion recognition framework from video sequences, composed of different subsystems: the first subsystem implements a preprocessing stage, in which face is automatically detected and the image pixels are normalized using a histogram equalization approach. In the second subsystem, a dimensional subspace from preprocessed images to construct a prototypical template for the N emotions is performed by a simple eigen-decomposition of each emotion scatter matrix. In the third subsystem, a matching stage is applied for comparing encoding representations from faces, using a canonical correlation analysis, via singular value decomposition (SVD). Performance obtained using this strategy is highly dependent on the dimensionality of subspaces created for encoding the images, which increases the computational complexity of the algorithm.

Zhalehpour et al. [24] propose a strategy to select those frames from the original video in which the emotion expressiveness achieves the highest point. Then, those frames are processed as still images. Authors achieve the task by grouping all the video frames into K clusters based on dissimilarity between the local phase quantization (LPQ) features in each frame. They use a complete agglomerative link clustering algorithm, named dendrogram clustering algorithm (DEND-CLUSTER) to group the frames and then, select a frame whose average distance from the rest of the frames in the cluster is minimum. Thus, they compute the ideal selection measure (ISM) score based on gradient of each pixel, to arrange the selected frames in descendent order. The peak frames features are used for training a SVM with a linear kernel to classify emotions. This approach did not consider dynamic motion change of the face, which it is an expected condition in a real context.

Poria et al. [25] propose a model for emotion classification from visual data using an extreme learning machine. In their strategy they first make a manual annotation of every frame in the video, to subtract those frames with no relevant emotional content (neutral). Then, they perform a facial extraction of relevant points of the face (eyes, mouth, nose, eyebrows and chin) using the Luxand FSDK software to obtain a vector of features per frame. The feature vector for each video is obtained then, using coordinate - wise averaging from feature vectors of individual frames. Finally, they use a extreme learning machine (ELM) to perform classification task of video clips. Despite authors achieved a high performance with their strategy (\(81.21\%\)), the main drawback lays on the manual annotation of the frames because it might elicit a significantly loss of frames with relevant emotional content to achieve the recognition.

In other works such as [26, 27] propose emotion recognition systems that achieves significant performances. However, their restriction consists in the limited amount of samples for the classifiers; which bias the generalization models. In [28] is presented a model using big data towards 5G. This model includes different datasets for emotion classification task and achieves a significant performance compared with another state of the art strategies, reaching accuracies between \(40.78\%\) (disgust) and \(76.9\%\) (happy).

3 Materials and Methods

Figure 1 illustrates a graphical overview of the proposed approach, which is mainly divided into three stages: the first one, named preparation data, concerns to applying some preprocessing steps to the original data in the database, so they can be used for training and evaluating a deep learning based classification model. The second one concerns to training the deep learning classification model based on the spatio-temporal analysis of data, which aims to recognize emotions. And the last, the testing stage, in which the trained model is used for deciding the emotion from a new video (not used in the training stage).

Fig. 1.
figure 1

Graphical overview of the proposed approach. In training stage, image sequences extracted from the database are used for trining a deep learning model composed of a dense convolutional layer, a dense GRU layer and a multilayer perceptron. In testing stage, the trained model is used for predicting the emotion oa a new image sequence.

3.1 Data Preparation

With the aim of reaching a valid experimentation stage comparable with previous works, the proposed approach was evaluated using a public, available and widely used dataset, i.e. the eNTERFACE’05 database [29]. It is a bimodal database, which contains video and audio signals of subjects expressing affective sentences in English language. The videos were acquired using a camera with a 25 fps sampling rate. 43 different non-professional actors (35 men and 13 women) coming from 14 different countries, which were asked to express emotions through specific sentences (five sentences) for six different emotions (anger, fear, disgust, happiness, surprise, and sadness). The database is composed of a total of 1290 bimodal samples, corresponding to five sentences per emotion (six emotions) for each of the 43 subjects. At this point, we separate audio signals from video frames and then, the audio signals were discarded. Some of the frames extracted from those image sequences are shown in Fig. 2.

Thus, 1290 image sequences were extracted and processed as follows in order to prepare data for be used in the experimental stage. Initially, a grayscale transformation was applied; then, original samples were processed for obtaining a dataset in which each instance contained the same number of frames, which is required to be processed by the deep learning model. Here, subsamples of one second were generated by splitting the original videos, without overlap. Additionally, because deep learning strategies requires a big amount of data for achieving optimal training parameters, a data augmentation strategy was applied by flipping every single frame in the sample, vertically. Additionally, four subjects in the database were discarded, due to lack of samples. Thus, a new experimental dataset was generated, which was composed of 5444 image sequences of 25 frames each.

Fig. 2.
figure 2

Frame samples for different emotional states of three subjects, taken from videos of the eNTERFACE database

3.2 Deep Learning Based Emotion Recognition from Image Sequences

Spatial and temporal relations from video samples described in Subsect. 3.1 were modeled using a deep learning architecture, which combines dense convolutional layers (for spatial information analysis) and recurrent units (for temporal modeling of data). The dense convolutional layer allows to extract spatial features from every sample. Given, each sample is compared with a sequence of still images, the convolution layers will learn main patterns of movements across the frames when emotions are expressed, and propagate them through the network layers. The dense convolutional layer contains 768 convolutional filters with size \(3\times 3\), a full padding and stride size of \(2\times 2\). To maximize extraction of spatial information in the network units during the training stage, the selected activation function for the convolutional layer was the rectified linear units (ReLU) [30], which had demonstrated to generate more expressive models than classical activation function ([31, 32]).

After convolutional layer, a dense recurrent layer was included to model the time dimension of video frames. So, spatial information extracted from the convolutional layers was considered, to be modeled as a sequence of temporal data which changes across the time. For doing so, a gated recurrent units (GRU cells) layer was implemented. GRU layer was used over Long-Short Term Memory units (LSTM cells - widely used in sequential models) due to they modulate the flow of information inside the unit, without having separate memory cells. This characteristic makes the GRU cell computationally more efficient. Besides, these units avoid vanishing gradient problem since the cell bypass multiple time steps, allowing the error to back-propagate easily [33]. The dense GRU layer contains 512 units with backward sequence processing (this means the cell take the sequence backward and then, reverse the output again), and a gradient clipping of 1 (to minimize computational cost).

Finally, a multilayer perceptron (MLP) was implemented in order to perform classification task. In this sense, the MLP was composed of 2 layers, with 64 units each, including a ReLU activation function. The output layer contains as many neurons as the number of emotions, with a softmax activation function, which allows to create a probabilistic density function of classes. The proposed approach is graphically illustrated in Fig. 1.

3.3 Adaptive Moment Estimation (Adam) Optimizer

The network parameters were optimized during the training stage with the stochastic gradient descent based algorithm, Adaptive Moment Estimation (Adam) [34]. Adam optimizer consists of gradients updating based on momentums (first moment - mean, and second moment - variance). This technique allows adaptive single parameter tuning (such as Adagrad [35] and RMSprop [36]), but considering gradients initialization and small decaying rates during the training stage. These both considerations significantly improve parameters optimization, increase performance and avoiding divergence in the network. Adam moment estimation (mean and variance) and optimization rules are detailed described in Eqs. (1), (2) and (3) respectively, where constant values for optimizer during training stage were set to \(\beta _1 = 0.9, \beta _2 = 0.999, \alpha = 0.001, \varepsilon = 10^{-8}\) (being \(\alpha =\) Learning rate).

$$\begin{aligned} m_t = \beta _1 m_{t-1} + (1 - \beta _1)g_t \end{aligned}$$
(1)
$$\begin{aligned} v_t = \beta _2 v_{t-1} + (1 - \beta _2)g^2 _t \end{aligned}$$
(2)
$$\begin{aligned} \theta _{t+1} = \theta _t - \frac{\alpha }{\root \of {v_t} + \epsilon } m_t \end{aligned}$$
(3)

4 Performance Evaluation

4.1 Experimental Setup

Proposed approach was twofold evaluated; a gender-dependent case and a gender-independent case. In the gender-dependent case, the learning model was trained using samples from subjects of the same gender, i.e. males and females, separately. In the gender independent case, whole samples were merged in a unique dataset.

According to the last, in the gender-independent case, the experimental dataset was composed of 5444 samples (as was described in Sect. 3.1). So, for each fold, 4355 samples were used for training and 1089 for testing the model. In the gender-dependent case, two experiments were performed male-dependent and female-dependent. In the male-dependent were considered 31 subjects found in the database, from which were extracted 4366 samples, allowing to take 3492 for training and 874 for testing. In the female-dependent, only 8 subjects were found in the database, from which were extracted 1220 samples; 976 for the training and 244 for the testing.

In both cases, the evaluation was performed using an iterative 5-fold cross-validation strategy. For each fold, a learning model was trained in 50 epochs, obtaining a testing accuracy (Eq. 4) and a confusion matrix. Then, a the mean of the five folds were computed and reported as overall accuracy and overall confusion matrix.

$$\begin{aligned} Accuracy = \frac{Correct \,\, predicted \,\, emotions \,\, (hits)}{Amount \,\, of \,\, samples} \end{aligned}$$
(4)

4.2 Experimental Results

Gender Dependent Case. Overall accuracies of 0.8275 and 0.8079 were obtained for female-dependent and male-dependent cases. Performance differences should be read with caution because the number of samples used for training each learning model. Likewise, Tables 1 and 2 report the confusion matrices for female-dependent and male-dependent cases, respectively. In the male-dependent case, similar results were reported for all the emotions, only the surprise reports a small performance (0.7086). On the other hand, in the female-dependent case, large differences were found. Sadness and surprise are clearly more distinguishable than disgust and fear. In this case, fear is confused with disgust, while disgust is more confused with the anger emotion. This could be expected, considering that female and male facial characteristics differ one from another.

Table 1. Confusion Matrix for emotion recognition in female.
Table 2. Confusion Matrix for emotion recognition in male.

Gender Independent Case. In the experiment where the gender of the samples was not considered; the proposed approach obtained an overall accuracy of 0.8184. The confusion matrix is shown in the Table 3. In this case, emotions with biggest recognition rate were disgust and surprise, obtaining 0.8620 and 0.8410 respectively, while fear and anger were most confused, showing a recognition rate of 0.7796 and 0.7881, respectively.

Table 3. Confusion Matrix for emotion recognition in gender independent case.

In order to compare experimental results with those found in state of the art, Table 4 shows the most relevant studies for facial emotion recognition, using the eNTERFACE database as was specified in Sect. 2. As can be observed the proposed approach outperforms reported results in emotion recognition in a gender independent classification.

Table 4. Performance comparison with related works in the state of the art.

5 Conclusions and Future Work

In this work, we propose an effective framework for emotion recognition from video, which exploits convolutional recurrent layers in a deep learning architecture. We proposed a video emotion recognition using dynamic spatio-temporal modeling based on deep neural networks that allows to distinguish between six different emotions (anger, disgust, fear, happiness, sadness, and surprise). Learning model was designed with three main components, the first component was a deep learning strategy specially designed to perform spatial analysis from the video frames through dense convolutional layer; the second was a dense GRU layer, which aims to model time changes of the spatial information extracted in the first one. The last one was a dense multilayer perceptron, which performed the classification task. The performance was evaluated in gender-dependent and gender-independent cases, using the eNTERFACE database, which enables a comparison with state-of-the-art methods. According with the results, the proposed approach outperform previous works in the gender-independent case. Previous works reporting gender-dependent evaluation were not found.

As future work, it is required to evaluate the real effect of training specific classification models according to subject gender. For doing so, an homogeneous database is required. Additionally, other aspects such as Language, country origin and age could be also considered. On the other hand, we pretend to explore the use of our strategy in real time models using virtual gadgets and avatars for emotion recognition. Additionally, we pretend to continue exploring spatio-temporal modeling using deep learning approaches for multimodal information fusion analysis.