Improving Emotion Detection with Sub-clip Boosting

Toto, Ermal; Foley, Brendan J.; Rundensteiner, Elke A.

doi:10.1007/978-3-030-10997-4_3

Ermal Toto²⁰,
Brendan J. Foley²⁰ &
Elke A. Rundensteiner²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11053))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2786 Accesses
1 Citations

Abstract

With the emergence of systems such as Amazon Echo, Google Home, and Siri, voice has become a prevalent mode for humans to interact with machines. Emotion detection from voice promises to transform a wide range of applications, from adding emotional-awareness to voice assistants, to creating more sensitive robotic helpers for the elderly. Unfortunately, due to individual differences, emotion expression varies dramatically, making it a challenging problem. To tackle this challenge, we introduce the Sub-Clip Classification Boosting (SCB) Framework, a multi-step methodology for emotion detection from non-textual features of audio clips. SCB features a highly-effective sub-clip boosting methodology for classification that, unlike traditional boosting using feature subsets, instead works at the sub-instance level. Multiple sub-instance classifications increase the likelihood that an emotion cue will be found within a voice clip, even if its location varies between speakers. First, each parent voice clip is decomposed into overlapping sub-clips. Each sub-clip is then independently classified. Further, the Emotion Strength of the sub-classifications is scored to form a sub-classification and strength pair. Finally we design a FilterBoost-inspired “Oracle”, that utilizes sub-classification and Emotion Strength pairs to determine the parent clip classification. To tune the classification performance, we explore the relationships between sub-clip properties, such as length and overlap. Evaluation on 3 prominent benchmark datasets demonstrates that our SCB method consistently outperforms all state-of-the art-methods across diverse languages and speakers. Code related to this paper is available at: https://arcgit.wpi.edu/toto/EMOTIVOClean.

You have full access to this open access chapter, Download conference paper PDF

Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild

Article 18 November 2015

A Comparison of Two Speech Emotion Recognition Algorithms: Pepper Humanoid Versus Bag of Models

MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016

Keywords

1 Introduction

Motivation. In recent years voice activated assistants such as Amazon Echo^{Footnote 1}, Google Home^{Footnote 2}, and Siri^{Footnote 3}, have become a prevalent mode of human-machine interaction. While prior findings [20] indicate that human-machine interactions are inherently social, current devices lack any empathic ability. If only we could automatically extract and then appropriately respond to the emotion of the speaker, this would empower a large array of advanced human-centric applications, from customer service [30] and education [19], to health care [24]. As it is best expressed by the advertising industry, emotion detection is key to “Capturing the heart” [23] of users and customers. Thus, as machines grow more complex, emotions can no longer be ignored.

Challenges. Emotion detection from voice continues to be a challenging problem and thus an active area of study [3, 12, 13, 18, 34, 35]. To emphasize the challenges of this task, it has been observed that even humans experience difficulty in distinguishing reliably between certain classes of emotion - especially for speakers from other nationalities [8].

In addition, the location of emotion cues can vary both between and within a voice clip. That is, in voice clips of different content, cues will naturally be in different locations. However, even for clips with the same textual content, cues will vary due to stark differences in expressing emotion between individuals [2].

Limitations of the State-of-the-Art. Emotion detection is an active area of research with studies ranging from text analysis [36] to facial pattern recognition [11]. Sentiment analysis from text suffers from problems specific to each language, even though humans share a set of universal emotions [1]. In addition, even when controlling for differences in language, emotion expression is highly non-verbal [15]. Thus, methods that consider facial expressions in addition to voice are at times utilized [11]. However, such multi-modal data types are in many cases either too obtrusive, too costly or simply not available.

To avoid the above pitfalls, our work concentrates in emotion detection from non-verbal voice features. In this context, despite the existence of universal emotions [6], state-of-the-art methods for emotion recognition still do not generalize well between benchmark datasets [7] (Table 3). Some approaches [16, 17] attempt to address the problem of generalization by eliminating clip outliers, and thus do not succeed to classify all data instances.

Simple multi-class regression methods have been widely used, but fail to achieve practical accuracy [25, 31]. Core work in emotion detection from voice uses a mixture of GMM (Gaussian Mixture Models) [14], and HMM (Hidden Markov Models) [5]. Most recently, with the increased popularity of deep learning (DL), we see the rise of attempts to apply DL to this problem, in particular, Random Deep Belief Networks [35], Recurrent Neural Networks (RNN) [13], and Convolutional Neural Networks (CNN) [3, 12, 34]. Convolutional Neural Networks (CNN) [3, 12, 34] achieve better results than other state-of-the-art methods, but still fail to generalize to all benchmark datasets and leave room for improvement. Due to these shortcomings, more work is needed in voice-based emotion recognition.

Our Proposed Approach. To solve this problem, we set out to design a novel Sub-clip Classification Boosting methodology (SCB). SCB is guided by our hypothesis that emotion tends not to be sustained over the entire length of the clip – not even for clips that have been carefully created for a benchmark. Therefore, the prediction of the emotion state of a clip needs to carefully consider its sub-clips. Sub-clipping (Sect. 5.1) effectively works as a moving lens (Fig. 1) over the sound signal - providing the first step to solving the undetermined boundary problem observed in a prior survey [2].

Next, a high dimensional non-verbal feature extraction from these voice sub-clips is tamed by prioritized feature selection [21] to reduce data dimensionality from more than 1500 features to a few hundred. As we will demonstrate, a dataset independent feature selection approach guided by the emotion classification problem itself, achieves better generalization across diverse benchmarks and emotion classes [7], which is one of the challenges observed with current state-of-the-art. In addition, utilizing non-verbal features overcomes classification challenges due to different languages.

To account for the unequal distribution of emotion cues between sub-clips, an Emotion Strength (ES) score is assigned to each sub-clip classification (SC), forming a SC, ES pair. Finally, inspired in part by FilterBoost [4], SCB implements an “Oracle” that utilizes all the Sub-Classification (SC), ES pairs (of a voice clip) to determine the parent clip classification.

To evaluate SCB, we have conducted an extensive study using three popular and widely used benchmark datasets described in Sect. 3. The results demonstrate that compared to the state-of-art algorithms, SCB consistently boosts the accuracy of predictions in all these benchmark datasets. For repeatability^{Footnote 4}, we have posted our SCB code base, associated data, and meta-data. To summarize, our main contributions include:

We design the SCB framework for the robust detection of emotion classes from voice across speakers, and languages.
We design an Oracle-based algorithm inspired by FilterBoost that utilizes sub-classification (SC), Emotion Strength (ES) pairs to classify the parent clip.
We tune our models, and explore the relationships between sub-clip properties, such as length and overlap, and the classification accuracy.
Lastly, we evaluate the resulting SCB framework using 3 benchmark datasets and demonstrate that SCB consistently achieves superior classification accuracy improvements over all state-of-the-art algorithms.

The rest of the paper is organized as follows. Section 2 situates our work in the context of Affective Computing, current methodologies for Emotion Detection, and Boosting Strategies. Section 3 introduces preliminary notation, and our problem definition. Section 4 introduces the Sub-clip Classification Boosting (SCB) Framework. Section 5 explains the SCB method and its sub-components. Section 6 describes our evaluation methodology and experimental results on benchmarks, while Sect. 7 concludes our work.

2 Related Work

Affective Computing. Affective computing [22] refers to computing that relates to, arises from, or deliberately influences emotion. Given that emotion is fundamental to human experience, it influences everyday tasks from learning, communication, to rational decision-making. Affective computing research thus develops technologies that consider the role of emotion to address human needs. This promises to enable a large array of services, such as better customer experience [30], intelligent tutoring agents [19] and better health care services [24]. Recent related research in psychology [27] points to the existence of universal basic emotions shared throughout the world. In particular it has been shown that at least six emotions are universal [6], namely: Disgust, Sadness, Happiness, Fear, Anger, and Surprise.

Emotion Detection from Voice. Detection of universal emotions from voice continues to be a challenging problem. Current state-of-the-art methods for emotion recognition do not generalize well from one benchmark database [7] to the next, as summarized in Table 3.

Some approaches [16, 17] attempt to address the problem of generalization by eliminating clip outliers belonging to particular emotion classes, and thus providing an incomplete solution.

SVM (Support Vector Machines) have been identified as an effective method, especially when working with smaller datasets [31]. However their accuracy remains low - not sufficient for practical purposes (Table 3). Softmax regression [25] is similar to SVM, in that it is a multi-class regression that determines the probability of each emotion class. The resulting, Softmax accuracy is higher than SVM, but still lower than the more recent state-of-the-art methods described below.

GMM (Gaussian Mixture Models) [14] are better at identifying individual differences through the discovery of sub-populations using normally distributed features.

HMM (Hidden Markov Models) [5] are a popular method utilized for voice processing, that relies on building up a probabilistic prediction from very small sub-clips of a voice clip. In contrast to SCB, the sub-clips considered by HMM are too small to contain information that can be accurately classified on its own.

More recently, with the increase in popularity of Deep Learning, Random Deep Belief Networks [35], Recurrent Neural Networks (RNN) [13], and Convolutional Neural Networks (CNN) [3, 12, 34] have been used for the emotion classification task. RNN [13] and Random Deep Belief Networks [35] have been shown to have low accuracy for this task (see also our Table 3). CNN [3, 12, 34] on the other hand has achieved some of the higher accuracy predictions for specific benchmark datasets. However, this method fails to generalize to other sets [3, 12, 34]. This draw back may come in part from the lack of large available datasets with labeled human emotions, which are hard to obtain given the unique nature of this task. As our in-depth analysis reveals, the accuracy of CNN is comparable to previously discussed and less complex methods, while still leaving room for improvement.

Boosting Strategies. SCB is a meta-model approach, such as Boosting, Bagging, and Stacking [26] more specifically a type of Boosting. Boosting methods have been shown to be effective in increasing the accuracy of machine learning models. Unlike other boosting methods that work with subsets of features or data, SCB utilizes sub-instances (sub-clips) to boost the classification accuracy. Results of a direct comparison with current meta-models are given in Table 2.

Table 1. Preliminary notation

Full size table

3 Preliminaries

We work with three benchmark datasets for emotion detection on voice clips, namely: RML Emotion Database^{Footnote 5} [33], an emotion dataset, containing 710 audio-visual clips by 8 actors in 6 languages, displaying 6 universal [6] emotions, namely: Disgust, Sadness, Happiness, Fear, Anger, and Surprise. Berlin Emotion Database^{Footnote 6} [32], a German emotion dataset, containing 800 audio clips by 10 actors (5 male, 5 female), displaying 6 universal emotions and Neutral. SAVEE^{Footnote 7} [10], an English emotion dataset, containing 480 audio-visual clips by 4 subjects, displaying 6 universal emotions and Neutral. Table 1 explicitly highlights key notation used in the paper.

Problem Definition. Given a set of voice clips $\varvec{v}^{\prime }_i \in \mathcal {V}$ with unknown emotion states, the problem is to predict the class $e^{\prime }_i$ for each instance $\varvec{v}^{\prime }_i$ with high accuracy.

4 The SCB Framework

To tackle the above emotion detection problem, we introduce the SCB framework (Sub-clip Classification Boosting) as shown in Fig. 2. Below we describe the main processes of the SCB framework.

Sub-clip Generation module takes as input a voice clip, the desired number of sub-clips N, and a percent overlap parameter $\varOmega $, to generate N sub-clips of equal length. During model training sub-clips are labeled after their parent clip. This process is further described in Sect. 5.1.

Feature Extraction is performed using openSMILE^{Footnote 8} [9] which generates a set of over 1500 features. Due to the high dimensionality of the data generated by openSMILE, at model training a feature selection mechanism is applied to reduce the complexity of further steps – allowing us to instead focus on the core classification technology.

Feature Selection is applied during model training. For this task we explored several strategies for dimension selection, all equally applicable. Finally, in our experiments, we work with one found to be particularly effective in our context, namely, the maximization of the mean decrease in node impurity based on the Gini Index for Random Forests [21]. Based on this metric, a vector with the importance of each attribute is produced. Guided by this importance criteria, the system selects a set of top features. For the purposes of this research, we select the top 350 features, a more than 4 fold reduction from the original set of features. However, our approach is general, and alternate feature selection strategies could equally be plugged into this framework.

Cached Feature Selection (On-line). During model deployment the cached list of features selected during training is looked-up to reduce the complexity of Feature Extraction from new clips.

Sub-clip Classification module trains sub-classification models to assign (universal and neutral) emotion labels to future sub-clips. Several ML algorithms are evaluated for this task (Sect. 5.2).

Sub-clip Emotion Strength Scores are computed for each sub-classification. To reduce bias, SCB utilizes a non-linear ensemble of multiple regression models that generate scores between 0 and 1, where a score close to 0 indicates low Emotion Strength or a weak sub-classification, while a score closer to 1 indicates high Emotion Strength or a strong sub-classification (Sect. 5.2).

Clip Classification “Oracle” evaluates SC, and ES pairs to classify the parent clip. Several “Oracle” implementations are evaluated, namely, Cumulative Emotion Strength, Maximum Emotion Strength, and Sub-clip Voting. These are discussed in more detail in Sect. 5.3.

5 Sub-clip Classification Boosting

Our proposed Sub-clip Classification Boosting (SCB) method (Algorithm 1) is based on the intuition that strong emotion cues are concentrated unevenly within a voice clip. Thus, our algorithm works as a moving lens to find sub-clips of a data instance that can be used to boost the classification accuracy of the whole. Each sub-clip is independently classified, and for each sub-classification (SC) an Emotion Strength score (ES) is also predicted. To combine SC, ES pairs into the parent clip classification, SCB implements a FilterBoost [4] inspired “Oracle” (Sect. 5.3).

A high level description of the SCB algorithm is given in Fig. 3, while the actual method is summarized in Algorithm 1, which takes as inputs: a voice clip $\varvec{v}^{\prime }_i$, the number of sub-clips N, sub-clip overlap $\varOmega $, and a pre-selected set of features $\mathcal {F}$. The length of the sub-clips is than determined in Line 1. Next, the parent voice clip $\varvec{v}^{\prime }_i$ is cut into sub-clips (Line 6) which are processed one at a time. Then, openSMILE (Line 9) extracts a preselected set of features $\mathcal {F}$, as described in Sect. 4, from each individual sub-clip. These features are stored as a vector $\varvec{f}^{\prime N,\varOmega }_{i,j}$. In Line 10 a previously trained sub-classification model C (Sect. 5.2) labels the sub-clip based on the feature vector $\varvec{f}^{\prime N,\varOmega }_{i,j}$. In Line 12 the ES score is computed by a regression model R (Sect. 5.2). Thus, each sub-clip receives a SC, ES pair of a label and a score. In the example in Fig. 3, the pairs are (Happy, 0.9), (Neutral, 0.5), and (Sad, 0.1). SC labels, and ES scores are respectively saved in the $\mathcal {\hat{E}}^{\prime N,\varOmega }_i$ and $\mathcal {\hat{W}}^{\prime N,\varOmega }_i$ ordered sets.

Finally, the Classification “Oracle” evaluates the $\mathcal {\hat{E}}^{\prime N,\varOmega }_i$ and $\mathcal {\hat{W}}^{\prime N,\varOmega }_i$ ordered sets, and outputs the parent clip classification (Line 16 as described in Sect. 5.3.

5.1 Sub-clip Generation

To generate sub-clips, each voice clip $\varvec{v}_i$ is split into N sub-clips $\varvec{v}^{N,\varOmega }_{i,j}$ with $\varOmega $ overlap, where $0\le \varOmega <1$ is relative to the length of the sub-clips. The sub-clip length $|\varvec{v}^{N,\varOmega }_{i,j}|$ is computed by Eq. 1. Sub-clips are cut from their parent clip as described in Algorithm 1 (Lines 1–8) and illustrated in Fig. 3.

$$\begin{aligned} |\varvec{v}^{N,\varOmega }_{i,j}| = \frac{|\varvec{v}_i|}{(N-(N-1) \times \varOmega )}. \end{aligned}$$

(1)

Tuning the number of clips and their overlap has interesting implications for the accuracy of the parent clip classification (Sects. 6.5.2 and 6.5.1). Increasing the number of sub-clips, increases the number of independent classifications per clip, however it also reduces the amount of information per sub-clip. To mitigate the information reduction due to sub-clipping, sub-clip overlap adds additional, but redundant information to each sub-clip. This redundancy needs to be controlled during model training and testing in order to avoid over-fitting. Our results show that a certain amount of overlap increases parent-clip classification accuracy. By tuning these parameters we discovered optimal zones for both N and $\varOmega $, making these hyper parameters important to model tuning.

5.2 Sub-classification and Emotion Strength Pairs

Sub-classification (SC). A sub-clip classification Model C is trained using labeled feature vectors of audio sub-clips, so that it can predict the label of a new instance represented by a feature vector $\varvec{f}^{\prime N,\varOmega }_{i,j}$. Therefore, this process can be expressed as $\hat{e}^{\prime N,\varOmega }_{i,j} = C(\varvec{f}^{\prime N,\varOmega }_{i,j})$, where $\hat{e}^{\prime N,\varOmega }_{i,j}$ is the SC label of the sub-clip $\varvec{v}^{\prime N,\varOmega }_{i,j}$. To implement these models, we evaluate several machine learning algorithms suitable for classification, such as SVM, RandomForest, kNN, and naiveBayes.

Emotion Strength (ES). Given a predicted SC label $\hat{e}^{\prime N,\varOmega }_{i,j}$, the ES score $0\le \hat{w}^{\prime N,\varOmega }_{i,j}\le 1$ of the sub-clip $\varvec{v}^{\prime N,\varOmega }_{i,j}$ is predicted by a previously trained regression model such that $\hat{w}^{\prime N,\varOmega }_{i,j}= R(\varvec{f}^{\prime N,\varOmega }_{i,j} | \hat{e}^{\prime N,\varOmega }_{i,j})$. Since SC labels are unknown prior to sub-clip classification, separate ES regression models need to be trained for each of the 7 emotions (6 universal emotions, and Neutral) that are present in the datasets. During training, categorical labels are converted to binary values, with 1 indicating the presence of the emotion for which the regression model is being trained to generate an ES score, and 0 for all other emotions. Several machine learning algorithms are tested for this task, such as SVM, KNN, ANN, and Linear Models. A non-linear ensemble of models such as $\hat{w}^{\prime N,\varOmega }_{i,j}= R_{SVM} (\varvec{f}^{\prime N,\varOmega }_{i,j} | \hat{e}^{\prime N,\varOmega }_{i,j}) \times R_{kNN} (\varvec{f}^{\prime N,\varOmega }_{i,j} | \hat{e}^{\prime N,\varOmega }_{i,j})$, was found optimal for reducing bias. Intuitively, ES scores closer to 1 are viewed as strong support for the SC label of the (SC,ES) pair, while low ES scores indicate low support for the pair’s SC label.

5.3 Clip Classification “Oracle”

We evaluate three “Oracle” implementations for the classification of the parent clips from SC, ES pairs, namely, Cumulative Emotion Strength Sub-clip Classification, Maximum Sub-clip Emotion Strength Classification and Sub-clip Voting Classification. In each of the implementations, the “Oracle” receives $\mathcal {\hat{E}}^{\prime N,\varOmega }_i$ and $\mathcal {\hat{W}}^{\prime N,\varOmega }_i$ ordered sets containing SC labels and ES scores, and makes a decision on the classification of the parent clip as described below.

5.3.1 Cumulative Emotion Strength

(Equation 2) is an “Oracle” implementation based on the intuition that all sub-clips have a weight in the final clip classification. For example, if a parent clip is divided into 3 sub-clips, with the following SC, ES pairs: (Fear, 0.7), (Happiness, 0.4), (Happiness, 0.6), the emotion with the highest cumulative ES $(0.4+0.6=1)$ is Happiness.

$$\begin{aligned} \hat{e}^{\prime }_i = \mathop {\text {argmax}}\limits _{e \in \mathcal {E}} \sum \limits _{\forall j \mid \hat{e}^{\prime N,\varOmega }_{i,j} = e}\hat{w}^{\prime N,\varOmega }_{i,j}. \end{aligned}$$

(2)

ES scores, play no role in sub-classification, and are only used during the parent clip classification. In case of a tie, one of multiple possible labels is chosen at random.

5.3.2 Maximum Emotion Strength

(Equation 3) is an “Oracle” implementation based on the intuition that certain sub-clips best represent the overall emotion of the parent clip. Therefore, this method promotes the sub-classification with the highest ES score. To illustrate, if a target clip is divided into 3 sub-clips, with the following SC, ES pairs: (Fear, 0.7), (Happiness, 0.4), (Happiness, 0.6), the parent clip would be classified as Fear based on the maximum emotion strength. In case of a tie, one of multiple possible labels is chosen at random.

$$\begin{aligned} \hat{e}^{\prime }_i = \mathop {\text {argmax}}\limits _{e \in \mathcal {E}} (\text {max}(\hat{w}^{\prime N,\varOmega }_{i,j} \mid \hat{e}^{\prime N,\varOmega }_{i,j} = e)). \end{aligned}$$

(3)

5.3.3 Sub-clip Voting Classification

(Equation 4) is an “Oracle” implementation based on the intuition that all sub-classifications contribute equally to the parent classification, in essence a voting ensemble. To illustrate, if a parent clip is divided into 5 sub-clips, of which three are classified as Fear one as Happiness and one as Anger, the clip will be classified as Fear. In case of a tie, one of multiple possible labels is chosen at random.

$$\begin{aligned} \hat{e}^{\prime }_i = \mathop {\text {argmax}}\limits _{e \in \mathcal {E}} (\text {count}(\hat{e}^{\prime N,\varOmega }_{i,j} = e)). \end{aligned}$$

(4)

6 Experimental Evaluation

To evaluate the performance of our SCB framework we conducted comprehensive experiments using three benchmark datasets that are commonly used to compare state-of-the-art methods. The experimental results show that SCB can achieve up to 4% improvement in classification accuracy over current state-of-the-art methods, while consistently achieving better generalization between datasets.

6.1 Benchmark Data Sets

SCB was evaluated with three popular benchmark datasets for emotion detection on voice clips described in Sect. 3. These dataset are chosen for their complete coverage of the six universal emotions [6], namely: Disgust, Sadness, Happiness, Fear, Anger, and Surprise. In addition, SAVEE and the Berlin Emotion Database, contain voice clips labeled as Neutral. These datasets also provide clips in different languages and from diverse actors. Further, they are widely used by state-of-the-art methods, thus provide solid baseline evaluations.

6.2 Comparison of Alternate Methods

To validate SCB we utilized a two fold comparison approach. First, we compare against common baseline methods deployed within the SCB framework as replacements for the SCB method (Algorithm 1). For this task, we compare against common classification algorithms such as kNN, SVM and randomForest, as well as boosting and ensemble methods (summarized in Table 2).

Second, we compare against state-of-the-art methods that classify emotion from the same benchmark datasets, namely, Hidden Markov Model [31], Gaussian Mixture Model [37], Orthogonal and Mahalanobis Distance Based [17], Random Deep Belief Networks [35], Simple SVM [29], Adaptive Boosting [28], Softmax Regression [25], CNN [3, 12, 34], and RNN [13]. These methods are briefly described in Sect. 2. Results of comparisons with state-of-the-art methods are summarized in Table 3.

6.3 Evaluation Settings

To directly compare to state-of-the-art methods, we utilize parent clip classification accuracy defined as the ratio of True Clip Classifications (TC) and the Total Number of classifications, which includes False Classifications (FC), $\frac{TC}{TC+FC}$.

For consistency, all models are trained and validated using a n-fold cross-validation approach, where $n=5$. It is important to note that when $\varOmega > 0$ adjacent sub-clips overlap, thus share common information. If not controlled, overlap will cause partial over-fitting and artificially higher model accuracy. Therefore, we operate under the constrain that it is critical that sub-clips of the same clip are kept in one fold during model training and testing procedures, thus avoiding any over-fitting.

6.4 Evaluating “Oracle” Implementations

Different “Oracle” implementations discussed in Sect. 5.3 are evaluated for their relative classification accuracy. The results are illustrated in Fig. 4. Overall, classification based on the sub-clip Maximum Emotion Strength yields the best and most consistent results. This finding supports our hypothesis that emotion cues are localized. It addition, it was also observed that SVM was the best performing sub-classification algorithm. Results are further improved after hyper parameter tuning in Sect. 6.5.

6.5 Hyper Parameter Tuning

6.5.1 Sub-clip Overlap

In the examples in Figs. 5a through e, overlap $\varOmega $ displays optimal ranges that present tuning opportunities. Further, the optimal range for $\varOmega $, interacts with N, as seen for $N=3$ (Fig. 5a), $N=5$ (Fig. 5b), and $N=7$ (Fig. 5c, EmoDb). These finding are consistent across datasets as also seen for the Berlin EmoDB for $N=3$ (Fig. 5d), and $N=5$ (Fig. 5e).

6.5.2 Number of Sub-clips

One example of tuning for optimal number of sub-clips, is given in Fig. 5f. In this example, parent clip classification accuracy peaks between $5 \le N \le 7$, while $\varOmega = 0.4$. This is consistent with the intuition that more sub-clips increase classification opportunities for the parent clip, but decrease the amount of information in each sub-clip, therefore a trade-off is expected. In addition, the amount of information in each clip also depends on the amount of overlap $\varOmega $, the optimization of which is explored in Sect. 6.5.1.

Table 2. Comparison of SCB to baseline methods. When used as baselines, classification algorithms (kNN, RF, SVM and Naive Bayes) directly classify the parent clip using it’s feature vector. In the context of AdaBoost, multiple models using the same algorithm are created using feature subsets to boost the model accuracy. In Stacking, multiple models from different algorithms are combined. In the context of SCB, classification algorithms are used to generate sub-classification (SC) labels

Full size table

6.6 Overall Results

In the context of emotion detection SCB is superior to Baseline (Table 2) and state-of-the-art algorithms (Table 3) for all three of the evaluated datasets. We also note that some of the state-of-the-art methods are not directly comparable. In the case of [17], the method eliminates 5% of the data that is labeled as outliers. Similarly, the accuracy for [25] was adjusted based on the number of instances per class. Coincidentally, in the RML EmoDB dataset Happiness had the highest classification error rate, but a lower number of instances. In that case, applying this [25] approach would artificially increase accuracy, however when applied to the SAVEE dataset which has a balanced number of instances, the same method failed to produce similar accuracy improvements. Therefore, in comparable settings SBC consistently outperforms all state-of-the-art algorithms and baselines.

7 Conclusion and Future Work

In this study we introduced Sub-clip Classification Boosting (SCB), a novel boosting method based on sub-clip classification. Further we discovered interesting trade-off patterns related to the number of sub-clips and sub-clip overlap. We also introduced and evaluated, several “Oracle” implementations to determine parent clip classifications from its sub-clips. Lastly, we evaluated our SCB framework using 3 benchmark datasets and achieved significant parent clip classification accuracy improvements over state-of-the-art algorithms, as well as consistent generalization across datasets.

Table 3. Accuracy evaluation of the SCB framework. $^*$Method eliminates 5% of the data. Accuracy not directly comparable. $^{**}$Accuracy was adjusted based on number of instances per class, and controlling for speaker, therefore not comparable. ${***}$Across speakers accuracy varied from 0.84 to 0.845 for the Berlin EMODb. One study [12] used the SAVEE dataset in addition to Berlin EMODb

Full size table

Notes

References

Abbasi, A., Chen, H., Salem, A.: Sentiment analysis in multiple languages: feature selection for opinion classification in web forums. ACM Trans. Inf. Syst. (TOIS) 26(3), 12 (2008)
Article Google Scholar
Anagnostopoulos, C.N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2015)
Article Google Scholar
Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International Conference on Platform Technology and Service, PlatCon, pp. 1–5. IEEE (2017)
Google Scholar
Bradley, J.K., Schapire, R.E.: FilterBoost: regression and classification on large datasets. In: NIPS, pp. 185–192 (2007)
Google Scholar
Chenchah, F., Lachiri, Z.: Speech emotion recognition in acted and spontaneous context. Proc. Comput. Sci. 39, 139–145 (2014)
Article Google Scholar
Ekman, P.: Strong evidence for universals in facial expressions: a reply to Russell’s mistaken critique (1994)
Article Google Scholar
El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)
Article Google Scholar
Eyben, F., Unfried, M., Hagerer, G., Schuller, B.: Automatic multi-lingual arousal detection from voice applied to real product testing applications. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5155–5159. IEEE (2017)
Google Scholar
Eyben, F., Weninger, F., Gross, F., Schuller, B.: Recent developments in openSMILE, the Munich open-source multimedia feature extractor. In: Proceedings of the 21st ACM International Conference on Multimedia, pp. 835–838. ACM (2013)
Google Scholar
Haq, S., Jackson, P., Edge, J.: Audio-visual feature selection and reduction for emotion classification. In: Proceedings of International Conference on Auditory-Visual Speech Processing, AVSP 2008, Tangalooma, Australia, September 2008
Google Scholar
Hossain, M.S., Muhammad, G., Alhamid, M.F., Song, B., Al-Mutib, K.: Audio-visual emotion recognition using big data towards 5G. Mobile Netw. Appl. 21(5), 753–763 (2016)
Article Google Scholar
Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM International Conference on Multimedia, pp. 801–804. ACM (2014)
Google Scholar
Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K., Mahjoub, M.A.: A review on speech emotion recognition: case of pedagogical interaction in classroom. In: 2017 International Conference on Advanced Technologies for Signal and Image Processing, ATSIP, pp. 1–7. IEEE (2017)
Google Scholar
Kishore, K.K., Satish, P.K.: Emotion recognition in speech using MFCC and wavelet features. In: 2013 IEEE 3rd International Advance Computing Conference, IACC, pp. 842–847. IEEE (2013)
Google Scholar
Knapp, M.L., Hall, J.A., Horgan, T.G.: Nonverbal Communication in Human Interaction. Cengage Learning, Boston (2013)
Google Scholar
Kobayashi, V., Calag, V.: Detection of affective states from speech signals using ensembles of classifiers. In: FIET Intelligent Signal Processing Conference (2013)
Google Scholar
Kobayashi, V.: A hybrid distance-based method and support vector machines for emotional speech detection. In: Appice, A., Ceci, M., Loglisci, C., Manco, G., Masciari, E., Ras, Z.W. (eds.) NFMCP 2013. LNCS, vol. 8399, pp. 85–99. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-08407-7_6
Chapter Google Scholar
Kraus, M.W.: Voice-only communication enhances empathic accuracy. Am. Psychol. 72(7), 644 (2017)
Article Google Scholar
Litman, D.J., Silliman, S.: ITSPOKE: an intelligent tutoring spoken dialogue system. In: Demonstration Papers at HLT-NAACL 2004, pp. 5–8. Association for Computational Linguistics (2004)
Google Scholar
Nass, C., Moon, Y.: Machines and mindlessness: social responses to computers. J. Soc. Issues 56(1), 81–103 (2000)
Article Google Scholar
Pal, M.: Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26(1), 217–222 (2005)
Article MathSciNet Google Scholar
Picard, R.W.: Affective computing (1995)
Google Scholar
Poels, K., Dewitte, S.: How to capture the heart? Reviewing 20 years of emotion measurement in advertising. J. Advert. Res. 46(1), 18–37 (2006)
Article Google Scholar
Riva, G.: Ambient intelligence in health care. CyberPsychol. Behav. 6(3), 295–300 (2003)
Article MathSciNet Google Scholar
Sun, Y., Wen, G.: Ensemble softmax regression model for speech emotion recognition. Multimed. Tools Appl. 76(6), 8305–8328 (2017)
Article Google Scholar
Todorovski, L., Džeroski, S.: Combining classifiers with meta decision trees. Mach. Learn. 50(3), 223–249 (2003)
Article Google Scholar
Valstar, M., et al.: AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, pp. 3–10. ACM (2013)
Google Scholar
Vasuki, P.: Speech emotion recognition using adaptive ensemble of class specific classifiers. Res. J. Appl. Sci. Eng. Technol. 9(12), 1105–1114 (2015)
Article Google Scholar
Vasuki, P., Vaideesh, A., Abubacker, M.S.: Emotion recognition using ensemble of cepstral, perceptual and temporal features. In: International Conference on Inventive Computation Technologies, ICICT, vol. 2, pp. 1–6. IEEE (2016)
Google Scholar
Verhoef, P.C., Lemon, K.N., Parasuraman, A., Roggeveen, A., Tsiros, M., Schlesinger, L.A.: Customer experience creation: determinants, dynamics and management strategies. J. Retail. 85(1), 31–41 (2009)
Article Google Scholar
Vlasenko, B., Wendemuth, A.: Tuning hidden Markov model for speech emotion recognition. Fortschritte der Akustik 33(1), 317 (2007)
Google Scholar
Vogt, T., André, E., Bee, N.: EmoVoice—a framework for online recognition of emotions from voice. In: André, E., Dybkjær, L., Minker, W., Neumann, H., Pieraccini, R., Weber, M. (eds.) PIT 2008. LNCS, vol. 5078, pp. 188–199. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-69369-7_21
Chapter Google Scholar
Wang, Y., Guan, L.: Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimed. 10(5), 936–946 (2008)
Article Google Scholar
Weißkirchen, N., Bock, R., Wendemuth, A.: Recognition of emotional speech with convolutional neural networks by means of spectral estimates. In: 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pp. 50–55. IEEE (2017)
Google Scholar
Wen, G., Li, H., Huang, J., Li, D., Xun, E.: Random deep belief networks for recognizing emotions from speech signals. Comput. Intell. Neurosci. 2017 (2017)
Article Google Scholar
Yu, D., Deng, L.: Automatic Speech Recognition. SCT. Springer, London (2015). https://doi.org/10.1007/978-1-4471-5779-3
Book MATH Google Scholar
Zao, L., Cavalcante, D., Coelho, R.: Time-frequency feature and AMS-GMM mask for acoustic emotion classification. IEEE Signal Process. Lett. 21(5), 620–624 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Worcester Polytechnic Institute, 100 Institute Road, Worcester, MA, 01609, USA
Ermal Toto, Brendan J. Foley & Elke A. Rundensteiner

Authors

Ermal Toto
View author publications
You can also search for this author in PubMed Google Scholar
Brendan J. Foley
View author publications
You can also search for this author in PubMed Google Scholar
Elke A. Rundensteiner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ermal Toto .

Editor information

Editors and Affiliations

Leuphana University, Lüneburg, Germany
Ulf Brefeld
National University of Ireland, Galway, Ireland
Edward Curry
IBM Research - Ireland, Dublin, Ireland
Elizabeth Daly
University College Dublin, Dublin, Ireland
Brian MacNamee
Nokia (Ireland), Dublin, Ireland
Alice Marascu
Vodafone, Milan, Italy
Fabio Pinelli
IBM Research - Ireland, Dublin, Ireland
Michele Berlingerio
University College Dublin, Dublin, Ireland
Neil Hurley

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Toto, E., Foley, B.J., Rundensteiner, E.A. (2019). Improving Emotion Detection with Sub-clip Boosting. In: Brefeld, U., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. Lecture Notes in Computer Science(), vol 11053. Springer, Cham. https://doi.org/10.1007/978-3-030-10997-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-10997-4_3
Published: 18 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10996-7
Online ISBN: 978-3-030-10997-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Improving Emotion Detection with Sub-clip Boosting

Abstract

Similar content being viewed by others

Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild

A Comparison of Two Speech Emotion Recognition Algorithms: Pepper Humanoid Versus Bag of Models

MEC 2016: The Multimodal Emotion Recognition Challenge of CCPR 2016

Keywords

1 Introduction

2 Related Work

3 Preliminaries

4 The SCB Framework

5 Sub-clip Classification Boosting

5.1 Sub-clip Generation

5.2 Sub-classification and Emotion Strength Pairs

5.3 Clip Classification “Oracle”

5.3.1 Cumulative Emotion Strength

5.3.2 Maximum Emotion Strength

5.3.3 Sub-clip Voting Classification

6 Experimental Evaluation

6.1 Benchmark Data Sets

6.2 Comparison of Alternate Methods

6.3 Evaluation Settings

6.4 Evaluating “Oracle” Implementations

6.5 Hyper Parameter Tuning

6.5.1 Sub-clip Overlap

6.5.2 Number of Sub-clips

6.6 Overall Results

7 Conclusion and Future Work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation