Keywords

1 Introduction

Speaking is very important in human-human interaction from previous ages till today. Besides talking, facial expressions is also important in communication because how a person is affected by conversation can be understandable from his gestures and expressions. With the development of technology, Facial Expression Recognition (FER) become one of the important research areas in Human Computer Interaction. Facial expressions can be recognized by computers hardly. Also, it becomes harder with the differences in facial images like skin color, hair type, age, gender and each person’s response to the same feeling. Furthermore, illumination changes, image resolution and acquisition difficulties do not facilitate the solution of problem.

Changes in the movement of some muscles in face create the facial expressions. By defining these changes, facial expressions can be recognized. Most FER system includes recognition of 7 classes of emotions happiness, sadness, surprise, anger, disgust, fear and contempt.

1.1 Related Work

Recognition of facial expressions is one of the important research areas in human - computer interaction. There are many studies in the literature on automatic recognition of facial expressions by using different methods. In previous studies, many feature extraction and machine learning methods have been tested for the FER problem.

Deng et al. is intended to recognize facial expressions with a system based on Gabor features using the New Local Gabor Filter Bank [1]. When extracting Gabor attributes, a Global Gabor Filter Bank with 5 frequency and 8 orientation information is used. During the computation of the performance of the Local Gabor Filter Bank method, they use Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) methods, which are two-stage compression methods, to compress and select Gabor features. Then the minimum distance classifier is included in the system to calculate the facial expression recognition performance of the system.

Local Binary Pattern (LBP) method can be used to recognize facial expressions. The operator used in the LBP method takes the center pixel value of a \(3 \times 3\) window as the threshold value and performs a labeling process (assigning 1 or 0) by comparing all the neighboring pixels with this value [2]. This process is repeated for all pixel values in the image, so a new binary image is obtained. The 8-bit two-base number is converted to the decimal number and it is assigned as new value of the center pixel. After that, histograms are generated from new image and the recognition process is performed by using these histograms as descriptors. The LBP method can be used effectively in recognizing personally independent facial expressions. Thus, in the literature, we can find many studies such as Boosted-LBP method, which is a slightly improved version of the original LBP method [3]. This study also used LBP method as a representation based on statistical local properties for facial expressions. In order to reveal the most distinctive LBP features in the mentioned study, the Boosted LBP method has been formulated. With this new method, recognition of facial expressions has been classified with Support Vector Classifiers(SVC) and the performance of the system is improved. In the other study, the Compound Local Binary Patterns (CLBP) method was applied to the FER problem [4]. In the Local Binary Patterns (LBP) method, the LBP operator encodes the sign of the difference between the gray values of the center pixel and the neighboring pixel P with the P bit, however in the proposed CLBP method, encoding is performed by using 2P bits. The other P bits are used to code the amplitude information of the difference between the gray values of the center and given number of neighborhood pixels with a threshold value. This method discards some neighborhood information to reduce the length of the feature vector. In order to include the locational information into the histogram up to a certain point, they used the developed CLBP histogram in the method. In this study, the recognition process was implemented with Support Vector Machines (SVM) as the classification algorithm.

There is a study worked on Local Directional Pattern (LDP), which is a property descriptor for FER problem [5]. The LDP features are obtained by calculating the edge response values of each pixel position and the eight directions of this pixel, and then a new code based on the power of the relative amplitude is obtained. Therefore, each expression is expressed as a distribution of LDP codes. Template Matching and Support Vector Machines have been used to measure the success of the method. With LDP histograms, detailed information such as edges, corners and local texture properties can be obtained. However, since histograms are calculated on all image, position information for micro-samples can not be kept in histograms. In order to possess this information, the images are generally divided into sub-regions and the position information can be included in the general histogram by calculating the histograms over those regions.

Sariyanidi et al. used Quantified Local Zernike Moments (QLZM) and Local Zernike Moments (LZM) methods in their FER study [6]. Various numbers of images were generated for each image by using Localized Zernike moments by being calculated Zernike moments around each pixel in the image. When these images are generated, the expression image is divided into \(N \times N\) subregions and LZM method is applied to each of the subregions. Then, the real and imaginary parts of each generated ZM coefficient are converted into binary values by using the signum function. With this conversion quantification process is carried out. With this approach, low-level features are determined by using LZM calculations, realized non-linear coding by using quantization, and combined features encoded on local histograms. The mentioned work has the very high success on the database used for face expressions recognition.

Another study on recognition of facial expressions is based on using the Local Directional Number Patterns method [7]. In this study, directional information of the tissues on the face is coded. The structure of each micro pattern is calculated with a mask and this information is coded with using significant direction indices (directional numbers) and pointers. The face is divided into many sub-regions and the distribution of the Local Directional Number Pattern features is obtained from these sub-regions. Then all features are concatenated to create an feature vector and it is used as a face descriptor.

A slackness is suggested to parallel hyperplane constraints; thus, a method called as modified correlation filters (MCF) has been proposed [8]. This method is inspired by Support Vector Machines and correlation filters and described as supervised binary classification algorithm. The MCF method provides energy minimization by using linear constraints. The usage of this method reduces the effect of outliers and noises in the training set. This study is applied for the recognition of facial expressions.

2 Feature Extraction Methods

In this section, the methods used in this study are mentioned. These methods are Global Zernike Moments(GZM), Local Zernike Moments(LZM) and Local XOR Patterns (LXP).

2.1 Global Zernike Moments

Moment based approaches are frequently used in pattern recognition processes and in image processing studies. The location, size, position and orientation information of the object are very important in pattern recognition studies. However, such these problems can be handled easily by using moment based approaches [9].

Zernike moments (ZM) are one of the methods used in image processing and computer vision problems. ZMs can produce successful results in some research areas such as Character recognition [10], fingerprint recognition [11] and pattern analysis [12] etc. ZMs are invariant to rotation. Also, it can be used as successful feature extraction method [13] (Fig. 1).

Calculation of Global Zernike Moments. Zernike moments can be defined as projections on complex Zernike polynomials. Zernike polynomials are expressed as an orthogonal polynomial set to the unit disk. Zernike Moments –f(\(\rho \),\(\theta \)) of a Gray level image is defined as [14]

$$\begin{aligned} Z_{nm} = \frac{n+1}{\pi }\int \limits _{\theta =0}^{2\pi }\int \limits _{\rho =0}^1 V_{nm}^*(\rho ,\theta )f(\rho ,\theta )\rho d\rho d\theta , \quad |\rho |\le 1. \end{aligned}$$
(1)

where \(V_{nm}(\rho ,\theta )\) is Zernike polynomials, n is the moment degree and m is number of iterations. \(V_{nm}(\rho ,\theta )\) polynomials are calculated as

$$\begin{aligned} V_{nm}(\rho ,\theta )=R_{nm}(\rho )e^{-jm\theta } \end{aligned}$$
(2)

where \(R_{nm}(\rho )\) is radial polynomials and calculated as s

$$\begin{aligned} R_{nm}(\rho )=\sum _{k=0}^{\frac{n-|m|}{2}}(-1)^k \frac{(n-k)!}{k!(\frac{n+|m|}{2}-k)!(\frac{(n-|m|)}{2}-k!}r^{n-2k} \end{aligned}$$
(3)

In these equations, n and m values are selected according to the \(n\ge 0\), \(0\le |m|\le n\), \(n-|m|\) is even rules.

Calculated Zernike Moments in Eq. (1) cannot be used in images. It should be converted to the discrete form. New equation can be shown as

$$\begin{aligned} Z_{nm}=\lambda (n,N)\sum _{i=0}^{N-1}\sum _{j=0}^{N-1}R_{nm}(\rho _{ij})e^{-jm\theta _{ij}}f(i,j), \quad 0\le \rho _{ij} \le 1 \end{aligned}$$
(4)
$$\begin{aligned} \rho =\sqrt{(c_1i+c_2)^2 + (c_1j+c_2)^2} \end{aligned}$$
(5)
$$\begin{aligned} \theta = tan^{-1}(\frac{c_1j+c_2}{c_1i+c_2}) \end{aligned}$$
(6)
Fig. 1.
figure 1

Images of Zernike Polynomials that corresponding to \(n=0,1,2,3,4\) values. [15].

2.2 Local Zernike Moments

When image moments are applied globally, they give successful results when used in classification or recognition processes where the shapes in the images are obvious [9]. One of these moments, Zernike moments, is based on the calculation of complex moment coefficients from the whole image; but this method is successful when working on such studies like character recognition.

Zernike moments make a unique definition on the whole image. These definitions are due to the use of radial polynomials at different degrees. Thanks to each of the variables used, different characteristic features of images are emerging [16]. It is seen that these holistic moment components are inadequate in face images and that the local features of images are more important. Thus, by localizing the Zernike moments method, that is, by calculating a new moment value around each pixel, Local Zernike moments (LZM) representations have been proposed, and successful results have been obtained in the face recognition by using this method [9]. The results which are obtained by applying Global Zernike moments and the application of Local Zernike Moments are shown in Fig. 2.

Fig. 2.
figure 2

The application of GZM and YZM methods.

In facial expression recognition, the local changes in the face image play very important role. LZM method was applied twice successively in this study because it provides that local features which are independent of illumination changes are obtained with the first application and shape statistics are determined with the second one.

While calculating global Zernike moments, all pixel values in the image are used; in Local Zernike Moments method, localization is performed by making these calculations for each pixel in the image. This localization process is used to extract the shape information in the image pattern.

Calculation of Local Zernike Moments. In LZM formula, moment-based operators, \(V_{mn}^k\) is calculated as

$$\begin{aligned} \varvec{V}_{nm}^k(i,j)=V_{nm}(\rho _{ij},\theta _{ij}) \end{aligned}$$
(7)

\(k \times k\) kernels can be considered as 2D convolution filters used in image filtering. According to these kernels, LZM transformation can be defined as

$$\begin{aligned} \varvec{Z}_{nm}^k(i,j)=\sum _{p,q=-\frac{k-1}{2}}^{\frac{k-1}{2}} f(i-p,j-q)\varvec{V}_{nm}^k(p,q) \end{aligned}$$
(8)

There are many different parameters in the LZM method. One of the most important of these parameters is n, which is expressed as moment degree. Different numbers of real and imaginary \(V_{mn}^k\) kernels are obtained depending on this moment degree. The real and imaginary components of the first 8 kernels obtained for \( k = 9 \) are shown in Fig. 3.

Fig. 3.
figure 3

(a) The real components of the first 8 kernels obtained for \(k=9\) (b) the imaginary components of the first 8 kernels obtained for \(k=9\) [17].

Real and imaginary images are produced corresponding to the number of real and imaginary \(V_{mn}^k\) kernels used, as seen from the Eq. 8. In this study, we discarded the images that produced by \(m=0\) valued filters from \(V_{mn}^k\) kernels since the imaginary parts of the images produced with these filters are 0.

The number of complex valued images obtained by the LZM transformation can be calculated as:

$$\begin{aligned} K(n)=\bigg \{\begin{aligned} \frac{n(n+2)}{4} \quad \text {n is even},\\\frac{(n+1)^2}{4} \quad \text {n is odd.} \end{aligned} \end{aligned}$$
(9)

In most face recognition applications, the advantages of dividing the images into subregions are discussed. Therefore, the LZM images are divided into subregions and this process is performed in two stages. At first stage, the images are divided into sub-regions of equal size \(N \times N\). In the second step, the images are divided into \((N-1) \times (N-1)\) sub-regions by a half-cell shifted grid and \(N^2 \times (N-1)^2\) is the total number of subregions for each image.

2.3 Local XOR Patterns

Original study about Local XOR Patterns are defined as Local Gabor XOR Patterns [18]. The sensitivity of Gabor phases to different poses is aimed to be reduced in the proposed method. If two phases are in the same range (ex. [\(0^{\circ }\mathrm {}\), \(90^{\circ }\mathrm {}\)], it is expressed that they have similar local characteristics. Also, it is explained that phase values that do not fall within the same range reflect completely different characteristics from each other.

Calculation of Local XOR Patterns. First of all, new pixel values are obtained by quantizing the phase values in [\(0^{\circ }\mathrm {}\), \(360^{\circ }\mathrm {}\)] range according to different intervals, in this LXP method. If the interval number is specified as 4, the values between [\( 0^{\circ }\mathrm {}\), \(90^{\circ }\mathrm {}\)]) are 0; \(90\,^{circ}\mathrm {}\), \(180^{\circ }\mathrm {}\)]): 1; [\(180^{circ}\ mathrm{}\), \(270^{\circ }\ mathrm{}\)]): 2; [\(270^{\circ }\mathrm {}\), \(360^{\circ }\mathrm {}\)]): 3. After the quantization is performed, the value of the center pixel value and neighboring pixels are subjected to the XOR process. If the center pixel has the same value as the neighboring pixel, then the neighboring pixel has 0; if not the same, the pixel is written as 1. It is resulted a binary image. This binary form is then converted to the decimal form. The corresponding formula is

$$\begin{aligned} \begin{aligned} \varvec{LGXP}\, \mu ,v (z_{c}) = [{LGXP}_{\mu ,v}^P,{LGXP}_{\mu ,v}^{P-1},\ldots ,{LGXP}_{\mu ,v}^1]_{binary}\\ = [\sum _{i=1}^{P} 2^{i-1}*{LGXP}_{\mu ,v}^i]_{decimal}. \end{aligned} \end{aligned}$$
(10)

The processing steps of the LXP method are shown in Fig. 4.

Fig. 4.
figure 4

Example of quantizing LXP method with 4 phase range and obtaining binary and decimal numbers [18].

where \(z_c\) is center pixel position in phase map, \(\mu \) is orientation, v is scale. \(LGXP(\mu ,v)^i (i=1,2,\ldots ,P)\) is the calculated example between \(z_c\) and its neighbor \(z_i\). It is calculated as

$$\begin{aligned} \varvec{LGXP}_{(\mu ,v)}^i=q(\phi _{(\mu ,v)} (z_c ) ) \bigotimes q(\phi _{(\mu ,v)} (z_i ) ), i=1,2,\ldots ,P. \end{aligned}$$
(11)

Where \(\phi _{(\mu ,v)} \bigodot \) is the phase, \(\bigotimes \) defines XOR process, \(q\bigodot \) specifies the quantization process. They can be defined as

$$\begin{aligned} c \bigotimes d=\bigg \{\begin{array}{ll} 0,\qquad \text {c = d ,} \\ 1, \qquad \text {c} \ne \text {d .} \end{array} \end{aligned}$$
(12)
$$\begin{aligned} q(\phi _{(\mu ,v)} \bigodot )= i; \frac{360*i}{b} \leqslant (\phi _{(\mu ,v)} < \frac{360*(i+1)}{b}, i=0,1,\ldots ,b-1 \end{aligned}$$
(13)

where b is th number of phase ranges.

3 Experimental Studies

3.1 Database

The Cohn-Kanade(CK) database is one of the most used databases in automatic facial expression recognition studies [19]. In this study we use uses the latest version of CK database, the CK + database (Extended CK database) [20]. In this database, there are 327 facial expression image of 118 different persons. The database contains image sequence of a facial expression of one person. These images starts from the natural expressions to the real expressions. In the database, there are a total of 7 expression classes including anger, disgust, fear, happiness, surprise, upset and contempt. The images of these classes are shown in Fig. 5. The distribution of expressions according to classes is given in Table 1.

Fig. 5.
figure 5

Expression images in dataset.

3.2 Preprocessing on Images

Facial expression images must be processed before classification process. In this study, we used the peak image of expressions as a result of the tests carried out. We performed our recognition system for only one still image. As a pre-process step, we choose some regions, such as the eyes and the mouths are very important in recognition, on the face while the expression occurs. In this study, 9 patches were specified on each face image and all patches are aligned. Specified patches on an image are shown in Fig. 6. The extracted patches according to the active appearance model points are shown in Fig. 7.

Table 1. Distribution of the expressions according to the classes.
Fig. 6.
figure 6

Specified patches on a face image.

Fig. 7.
figure 7

The extracted patches from the image.

3.3 Generating Feature Vectors

There are 3 steps for generating the feature vectors.

  • New images are generated with the application of LZM.

  • The binary images are obtained by applying LXP Method to the real and imaginary parts of the images, that are generated by LZM.

  • GZM approach is applied to these final images and features are obtained.

Application of Local Zernike Moments. LZM has previously been applied to the FER problem [21]. In the mentioned work, very high system performance can not be achieved by applying the LZM directly on the whole image, but the results are promising. Therefore, in this study, Local Zernike Moments is applied on each patch. With LZM, a new moment image with the same size as the original image is obtained by taking into account the values of the adjacent pixels to that pixel. This process is repeated for each moment degree to obtain a set of images. The real and imaginary parts of the images obtained as a result of applying LZM once can be seen in Fig. 8.

Fig. 8.
figure 8

The real and imaginary parts of the images obtained as a result of applying LZM once. Real images are shown in first line, Imaginary images are shown in second line. [21].

As a result of all the tests performed, it is seen that application of the LZM method twice improves the performance of the system. Therefore, \(n = 3\), \(m = 3\), \(k_1 = 5\), \(k_2 = 5\) were selected as the most suitable parameter values in this case. 36 moment images are obtained from a patch image according to the given parameters. When \(n = 3\), \(m = 3\), 4 real and 4 imaginary images are obtained as a result of applying the first layer LZM. With the second layer LZM, we have 32 image by applying LZM to 4 images from first layer and the total number of images we have is 36 as the second layer result. In the case of a facial expression, \(36 \times 10 = 360\) images are obtained. In facial expression recognition, the phase amplitude histograms of the LZM images were calculated [21]. However, the recognition of facial expressions has not been able to perform successfully with only the use of LZM, due to the fact that the calculation of histograms in facial expressions does not produce very successful results. Therefore, the loss of position information with histograms can be considered as a disadvantage. For this reason, we proposed a cascaded method for FER problem in this study. This cascaded method consists of 3 methods, LZM, LXP and GZM respectively.

Application of Local XOR Patterns. 36 images produced for each patch with LZM contain images with real and imaginary components. The Local XOR Pattern method obtains phase-valued images using these real and imaginary images. On these phase-valued images, binary images are obtained by using XOR operation according to neighborhood values. When binary images are obtained, the central pixel value is processed with XOR operation. Then, in the clockwise direction, obtained binary numbers are combined to form a decimal number, and this number is assigned to the center pixel value. Then, this process is repeated for the each pixels in the whole image with the help of \(3 \times 3\) or \(5 \times 5\) filters according to the neighboring degree as in the sliding window method. Thus, an image with new pixel values is obtained. The images obtained by the LXP method are shown in Fig. 9. In the figure, LZP method is applied to the resulting images of application of first layer LZM.

Fig. 9.
figure 9

Face images after LXP method applied.

There are 2 important parameters in LXP method. According to the test results for determining phase angle value and neighborhood value, both parameters are assigned as.

Application of Global Zernike Moments. GZM is a method based on the calculation of complex moment coefficients from the whole image. This method is successful in images that contain significant shape information, such as characters. Thus, instead of applying this method directly to the entire face image, it is understood that the use of patches will produce more successful results.

The most important parameter used in GZM method is the parameter which determines the moment degree. As this parameter is changed, the generated number of moments changes, so the calculated moments depend on this number of moments. That is, the dimension of the generated feature vector changes. In this study, we specified the moment degree parameter as 11.

Furthermore, we analyzed the performance of each patch. Each patch performance can be seen in the graphic in Fig. 10.

Fig. 10.
figure 10

The graphical representation of each patch success.

3.4 Classification Results

In this system, LZM, LXP and GZM approaches are applied to each patch. After GZM Method applied, the moment values of each patch are concatenated.

After the feature vectors are generated, the performance of the system can be tested by classification methods. For this reason, we applied LibSVM classification algorithm.

LibSVM. LibSVM, Support Vector Machine Library, method is frequently used in the multi-classification problem. In this study, LibSVM approach was applied to the vectors because we have 7 classes to classify. There are 4 core types used in this method, and one of these cores must be selected. For this reason, system performance has been tested with the use of all cores. The highest performance is achieved with the use of a linear kernel, as shown in the Table 2. So, we use LibSVM with Linear Kernel.

Table 2. Test results of LibSVM kernels.

We test the all 3 methods independently and we get the best results with the cascaded structure includes all of 3 methods. The confusion matrix containing the recognition rate for each expression is given in Table 3.

Table 3. Confusion matrix.

Our study is compared the other studies uses the same dataset. This is shown in Table 4.

Table 4. The comparison of the results in other studies.

3.5 Conclusion and Future Works

In this study, we proposed a new cascaded method which consists of Local Zernike Moments (LZM), Local XOR Patterns (LXP) and Global Zernike Moments (GZM) methods Facial Expression Recognition problem. Most common dataset, CK+ dataset, is used. The database possess image sequences of 327 expressions of 118 people. These expressions are happiness, sadness, surprise, anger, disgust, fear and contempt. We select the most emoted image and we applied the methods and obtained the feature vector. As a final step, LibSVM classifier with linear kernel is used as a classification algorithm. Calculated feature vectors are classified with LibSVM according to the leave one out cross-validation method. Facial expression recognition rate is measured as 90.34% for overall system.

We are planning to use the more than one image in the image sequence and add some features about the difference between those images to our feature vector. We can also try a cascaded structure with different classification algorithms in the classification step. As another future work, it is planned that histograms of the images after LZM and LXP applied can be calculated and used for feature vector.