Keywords

1 Introduction

Depression is a psychological disorder attributed to the presence of low mood and disinclination towards routine activities for a period generally longer than two weeks. It negatively impacts a persons well-being and is known to increase the risk of suicidal tendencies [1]. Of the 300 million people affected by depression globally, 57 million people (18%) belong to India [18]. Depression may be self-assessed using Patient Health Questionnaire (PHQ) [13] or may be determined through clinical interviews, which are based on Hamilton Depression Rating Scale [17]. These methods suffer from subjectivity and bias as they depend upon the honesty and willingness of the patient during interaction, and the clinicians ability to interpret the subjects response [25]. According to WHO, more than half of those affected by depression are misdiagnosed, thereby, increasing the false positives and false negatives [1]. Hence, it is desirable to build a decision support system to aid the clinicians in making an accurate depression assessment, based on objective behavioral markers such as, the speech properties, facial expressions and body gestures which are less likely to be suppressed by the patient [10].

Research community has studied the role of non-verbal behavior like facial emotions, speech and semantic information for depression diagnosis and have established their correlation with depression [4, 5, 15]. Some of the suggested methods for depression detection are unimodal [19, 29], while others are multimodal, that combine two or more modalities [5, 15]. Though the multimodal systems perform better, they entail high time and space complexity. Moreover, acquisition of data from multiple modalities incurs high cost. Hence, it is desirable to build a unimodal depression detection system that is cost effective, simple and efficient, in terms of acquisition as well as building the model.

Many patients suffering from depression are either unable to articulate their feelings or hesitate in discussing them with the clinician. In such situations, facial expressions used for determining the human emotions [24] can be helpful. Mehrabian et al. [14] stated that in our day to day interaction, facial expressions are responsible for 55% of the total information exchange, while language is responsible for only 7% of the daily interaction. Hence, in our work, we propose to strengthen the unimodal depression detection system based on only the facial cues obtained through video recordings.

In literature, various features from the face have been extracted and studied for their correlation with depression [21]. Recently, Pampouchidou et al. [20], proposed a novel method: Landmark Motion History Image (LMHI) to represent the movement of facial landmarks across the video frames. They concluded that Histogram of Oriented Gradient (HOG) features extracted from the LMHI: FaceHOGs are the most relevant features for depression classification, with an F1 score of 0.5 and 0.9, for detecting the depressed and the non-depressed category respectively. The experiment had been performed on the Distress Analysis Interview Corpus - Wizard of Oz (DAIC-WOZ) dataset [23]. The high dimensionality of the FaceHOG feature set may be the reason for the low F1 depressed score. This can be handled effectively using feature selection techniques.

To the best of our knowledge, application of feature selection has not been explored much for depression detection using videos. In our work, we investigate three popular univariate filter feature selection techniques: Fisher Discriminant Ratio (FDR) [8], Mutual Information (MI) [7] and Pearson Correlation (PC) [22], to find the relevant set of FaceHOG features [20]. It is well-known that the learning algorithm plays a key role in the development of the decision model to achieve high performance. Since filter feature selection determines the relevant features independent of the learning algorithm, it becomes important to investigate which combination of univariate feature selection method and learning algorithm provides the maximum performance. Hence, we explored four well-known classifiers: Decision Tree (DT), Linear Discriminant Analysis (LDA), k Nearest Neighbor (KNN) and Support Vector Machine (SVM) [8]. In literature, many of the depression detection models are built as regression problem. Hence, we have also investigated four well-known regression techniques for the determination of depression severity level: Decision Tree (DTR), Linear (LR), Partial Least Square (PLSR) and Support Vector (SVR) [8]. Through exhaustive experiments, we determine the most suitable feature selection technique, the best performing classification/regression technique and the best combination of the univariate feature selection method and the classification/regression technique. Section 2 summarizes the related work and Sect. 3 presents the experiments and results. Finally, Sect. 4 concludes our work and gives future directions.

Fig. 1.
figure 1

Landmark Motion History Image of a healthy person with Id 310

Fig. 2.
figure 2

Landmark Motion History Image of a depressed person with Id 321

2 Related Work

Facial cues have been widely studied in correlation with the mental state of the person [9]. Cohn et al. [4] extracted features using facial action units and Active Appearance Model for prediction of depression. Meng et al. [15] captured facial dynamics from videos in a Motion History Histogram (MHH) image and computed Local Binary Pattern (LBP) and Edge Oriented Histogram (EOH) features for the depression prediction. Cummins et al. [5] extracted Space-Time Interest Points and Pyramid of HOG features from the videos to estimate depression. Jan et al. [12] extracted LBP, EOH and Local Phase Quantization (LPQ) features for each video frame and captured their change across frames using 1-D MHH. Dimension of the resulting 1-D MHH features was reduced using Principal Component Analysis. In [6], facial and head movements were computed from the facial landmarks of a video and Min-Redundancy Max-Relevance method was used to select the relevant features. Nasir et al. [16] computed polynomial parameterization of the visual features and reduced their dimensionality using Mutual Information Maximization, for depression classification. Yang et al. [30] proposed Histogram of Displacement range method to compute features from the facial landmarks. Combination of Deep Convolutional Neural Network (DCNN) and deep neural network model was used for depression detection. Pampouchidou et al. [19] compared the performance of LMHI with Motion History Image and Gabor-inhibited LMHI. LBP, HOG and LPQ features were extracted from these images for depression prediction. Zhu et al. [31] extracted appearance features from static frames of video and motion features from the optical flow images using DCNN for depression detection. Jazaery et al. [2] used 3D CNN and Recurrent Neural Network to learn the spatio-temporal features from videos for depression detection.

3 Combination of Univariate Feature Selection and the Learning Algorithms

Pampouchidou et al. [20] extracted many visual feature sets viz. FaceLBP, FaceHOG, head motion, blinking rate etc. using the 2D facial landmarks given for each video frame. Using the method [20], we construct the LMHI of size \(252 \times 248\) pixels, that represents the motion of 2D facial landmarks. Figures 1 and 2 are two example LMHI. To extract the HOG features from the image, it is divided into cells of size \(32 \times 32\) pixels each and, gradient is computed for each pixel of the cell. A 9 bin histogram is then created to represent the contribution of all the pixels in a cell. To account for illumination variance, histogram normalization is done on blocks of size \(2\times 2\) cells. Cell histograms from all the blocks are concatenated to form the FaceHOG feature set of dimension 1296. Pampouchidou et al. concluded that FaceHOG features are the most relevant for classifying depression with an F1 Score of 0.5 and 0.9 for the depressed and non-depressed category respectively. However, dimensionality of the FaceHOG features is high in relation to the number of samples in the DAIC-WOZ dataset. This may cause overfitting of the decision model [3] and can possibly be the reason for a low F1 score for identifying the depressed individuals. To circumvent this problem, it is imperative to reduce the dimension of FaceHOG feature set. Using only a relevant subset of the features, helps to enhance the performance of the decision system. Several filter and wrapper methods are suggested in literature for feature selection [11]. Filter feature selection methods are much simpler than the wrapper approaches and help build a cost-effective system in terms of time and space. To our best knowledge, the univariate filter feature selection techniques have not been explored much for video-based depression detection. In this work, we determine a relevant subset of the FaceHOG features by using the three univariate filter feature selection techniques: Fisher Discriminant Ratio (FDR) [8], Mutual Information (MI) [7] and Pearson Correlation (PC) [22]. After obtaining the subset of relevant features, we apply four well-known classification techniques: DT, LDA, KNN and SVM, and four regression techniques: DTR, LR, PLSR and SVR, to determine the best combination of feature selection and learning algorithm, for effective depression identification (classification) and its severity estimation (regression) respectively.

4 Experiments and Results

All the experiments have been performed on the DAIC-WOZ dataset [23]. The data given had been partitioned into the training, development and test sets. Classification labels (depressed or non-depressed) and PHQ-8 scores (for regression) were given for all the sets except the test set, hence we train our model on the training set (107 samples) and test its efficacy on the development set (35 samples). Each univariate filter feature selection method, computes the relevance of each FaceHOG feature w.r.t. the response variable and ranks them in the descending order of their relevance score. The decision model is learned in the order of the ranked features incrementally and the minimum number of features that give the best performance are finally selected (#).

Performances of the three feature selection techniques (FDR, MI, PC), in conjunction with the four classification techniques (DT, LDA, KNN, SVM) and the four regression techniques (DTR, LR, PLSR, SVR) are compared in Tables 1 and 2 respectively. The classification performance is shown in terms of F1 depressed score (F1 dep), F1 non-depressed score (F1 ndep) and the regression performance is shown in terms of Mean Absolute Error (MAE), Root Mean Square Error (RMSE). On application of the three univariate methods, the minimum number of features (#) for which maximum performance is achieved (F1 dep for classification and MAE for regression), is recorded. With each learning method, feature selection technique that gives the best performance is highlighted in bold. Following are the observations based on Table 1:

  • Application of feature selection improves the performance of the model, except in the case of MI and PC with SVM.

  • For each classification technique, FDR outperforms MI and PC in terms of F1 dep, F1 ndep and the number of selected features (except for DT).

  • For each feature selection technique, LDA outperforms all other classifiers.

  • FDR followed by LDA gives the best F1 Score (depressed and non-depressed). Both FDR and LDA are based on the Fisher criterion. However, FDR is unable to perform feature combination like LDA and LDA is unable to discard the irrelevant features like FDR. They complement each other and the combination gives better performance.

Table 1. Classification comparison
Table 2. Regression comparison

Following observations are based on Table 2:

  • Use of feature selection improves the performance of all regression methods.

  • FDR outperforms MI and PC in combination with all the regression techniques except with LR.

  • PC based feature selection followed by LR gives the best performance in terms MAE and RMSE. PC selects those features which have high degree of linear correlation with the response variable, and LR models the linear relationship between features and the response variable. Hence, the combination of the two provides better performance.

Table 3 compares the best results of the proposed combination technique with the existing methods for depression detection based on visual cues. The proposed combination of FDR and LDA outperforms all the classification models suggested in literature. For, depression severity estimation, the combination of PC and LR outperforms most of the existing regression models.

Table 3. Comparison of the proposed model with the state-of-the-art

5 Conclusion

Features captured from video data are relevant for depression detection. They are a strong contender as a unimodal technique, that is capable of supporting clinicians for monitoring patients and correctly assessing the severity of their problem. Due to the high dimensionality of the extracted features from videos, the complexity of the decision models built for depression detection is high. Also, it provides an overfitted learning model. To circumvent this, we have employed univariate filter feature selection methods to reduce the dimensionality of features required to build the depression detection systems. Four well known classifiers and four regression methods have been successfully explored in combination with the feature selection techniques, and the role of feature selection has been emphasized. The relevant features obtained using FDR are transformed by the LDA classifier making the combination of FDR and LDA most appropriate for video-based depression classification. To diagnose the depression severity, PC in combination with the LR has been found to be the most suitable as both PC and LR are based on the linear correlation between the features and the response variable. The proposed combinations for classification and regression for video-based depression detection, outperform most of the existing results.

Future work will focus on reducing the dimension of other visual features obtained from the video data, and identify those features which are relevant for the task of depression detection. We will also explore advanced feature selection techniques which will not only eliminate irrelevant features with respect to the response variable, but also remove the redundant/correlated features.