Keywords

1 Introduction

Nowadays, vocal recognition of emotions has multiple applications in domains as diverse as medicine, telecommunications or transport [1]. For example, in telecommunications, it would become possible to priorities the calls from individuals in imminent danger situations over less relevant ones. In general, emotion recognition enables the improvement of human/machine interfaces, which justifies the unexpected increase of research on this field, due to the progresses in artificial learning.

Human interactions rely on multiple sources: body language, facial expressions, etc. A vocal message carries a lot of information that we translate implicitly. This information can be expressed or perceived verbally, but also non-verbally, through the tone, the volume or the speed of the voice. The automatic analysis of such information gives insights on the speaker emotional state.

The conceptualization of emotions is still a hot topic in psychology. Opinions do not converge towards a unique model. In fact, we can mainly differentiate three approaches [9]: (1) the basic emotions (Anger, Disgust, Fear, Happiness, Sadness, Surprise) described by Ekman [6], (2) the circumplex model of affect and (3) the appraisal theory. In the second model, the affective state is generally described, at least, by two dimensions: the valence which determines the positivity of the emotion and the arousal which determines the activity of the emotion [18, 23]. These two values, bounded on [−1,+1], describe much more precisely the emotional state of an individual than the basic emotions. However, it has been shown that other dimensions were necessary to report more accurately this state during an interaction [8].

The choice of one model or the other restrains the kind of machine learning algorithms used to estimate the emotional state. In case of basic emotions, the variable to be predicted is qualitative and nominal. Classification methods must be used. On the contrary, affective dimensions are quantitative, continuous, and bounded variables. So, regression predictor will be needed. To take advantage of the best of both worlds, we propose in this study a method that combines classification and regression. To predict a continuous and bounded variable, we first quantize the affect variable into bounded ranges. For example, a 5 ranges valence quantization would give the following boundaries {−1, −0.5, −0.2, +0.2, +0.5, +1}. It could be interpreted as “very negative”, “negative”, “neutral”, “positive” and “very positive”. Then, we proceed into 3 steps:

  • Train an ensemble of classifiers to estimate if the affect variable associated to an observation is higher than a given boundary;

  • Combine the ensemble decisions to predict the optimal range;

  • Regress locally the affect variable on this range.

The proposed method is therefore a cascade of ordinal classifiers and local regressors (COCLR). We will see in the following state of the art that similar proposals have been made. But in this paper, we perform a thorough study on the key parameter of this method: the number of ranges to be separated by the ensemble of ordinal classifiers. We show experimentally that:

  • On small and numerous ranges, ordinal classification performs well;

  • On large ranges, the COCLR cascade performs better;

  • On challenging databases (AVEC’2014 [23] and AV+EC’2015 [17], described in Sect. 4), the COCLR cascade can be compared favorably with challengers’ and winner’s proposals with an acceptable development and computational cost.

This paper is organized as follows. Section 2 focuses on the state of the art on affect prediction on audio data. In Sect. 3, we will present the COCLR flowchart. In Sect. 4, we will introduce the datasets used to train and evaluate our system and the different pre-processing realized. Then, in Sect. 5, we will expose and discuss our results. Finally, Sect. 6 offers some conclusions.

2 State of Art

The Audio-Visual Emotion recognition Challenges (AVEC), that takes place every year since 2011, enables to assess the systems proposed on similar datasets. The main objective of these challenges is to ensure a fair comparison between research teams by using the same data. Particularly, the unlabeled test set is released to registered participants some days before the challenge deadline. Moreover, the organizers provide to the competitors a set of audio and video descriptors extracted by approved methods.

The prosodic features such as the height, the intensity, the speech rate, and the quality of the voice, are important to identify the different types of emotions. Low level acoustic descriptors like energy, spectrum, cepstral coefficients, formants, etc. enable an accurate description of the signal [23]. Furthermore, it has recently been demonstrated that features learned by the first layers of deep convolutional networks were quite similar to some acoustic descriptors [22].

2.1 Emotion Classification and Prediction

The classification of emotion is done through classical methods like support vector machines (SVM) [2], Gaussian mixture models (GMM) [20] or random forests (RF) [15]. For regression tasks, numerous models have been proposed: support vector regressors (SVR) [5], deep belief networks (DBN) [13], bidirectional long-short term memory networks (BLSTM) [14], etc. As all these models having their own pros and cons, recent works focus on model combinations to improve overall accuracy. Thus, in [11], authors propose to associate BLSTM and SVR to benefit from the treatment of the past/present context of the BLSTM and the generalization ability of the SVR.

AV+EC’2015 challenge winners proposed in [12] a hierarchy of BLSTM. They deal with 4 information channels: audio, video (described by frame-by-frame geometric features and temporal appearance features), electrocardiogram and electro dermal activity. They combine the predictions of single-modal deep BLSTM with a multimodal deep BLSTM that perform the final affect prediction.

2.2 Ordinal Classification and Hierarchical Prediction

The standard approach to ordinal classification converts the class value into a numeric quantity and applies a regression learner to the transformed data, translating the output back into a discrete class value in a post-processing step [7]. Here, we work directly on numerical values of affect variables but quantify them into several ranges. Recently, a discrete classification of continuous affective variables through generative adversarial networks (GAN) has been proposed [3]. Five ranges are considered.

The idea of a combining regressors and classifiers has already been applied to deal with age estimation from images. In [10], a first “global” regression is done with a SVR on all ages. Then, it is refined by locally adjusting the age regressed value by using an SVM. In [21] authors propose another hierarchy on the same issue. They define 3 age ranges (namely “child”, “teen” an “adult”). An image is classified by combining the results of a pool of classifiers (SVC, FLD, PLS, NN and naïve Bayes) in a majority rule. Then, a second stage uses the appropriate relevant vector machine regression model (trained on one age range) to estimate the age.

The idea of such a hierarchy is not new, but its application to affect data, have not been proposed yet. Moreover, we show in the following experiments that the number of boundaries to be considered impacts the performance of the whole hierarchy.

3 Cascade of Ordinal Classifiers and Local Regressors

The cascade of ordinal classifiers and local regressors proposed here is a hybrid combination of classification and regression systems. Let us note X, the observation (feature vector), y the affective variable to be predicted (valence or arousal) and \( \widehat{y} \), the prediction. The variable y is continuous and defined on the bounded interval [−1, +1]. Therefore, it is possible to segment this interval into a set of n smaller sub-intervals called “ranges” in the following, bounded by the boundaries bi and bi+1 with i ∈ {1, n + 1}. For example, n = 2 define 2 ranges: [−1, 0[ (“negative”) and [0, +1] (“positive”) and 3 boundaries bi ∈ {−1, 0, +1}. Each boundary bi (except −1 and +1) may define a binary classification issue: given the observation X, the prediction \( \widehat{y} \) is lower (resp. higher) than bi. By combining the outputs of the (n − 1) binary classifiers, we get an ordinal classification. Given the observation X, the prediction \( \widehat{y} \) is probably (necessarily in case of perfect classification) located within the range [bi, bi + 1[. Once this range obtained, a local regression is run on it along to its direct neighbors to predict y. Figure 1 illustrates the full cascade. The structure of this system is modular and compatible with any kind of classification and regression algorithms. Moreover, it is generic and may be adapted to other subjects than affective dimension prediction.

Fig. 1.
figure 1

COCLR: a two-stage cascade. The first stage is a combination of binary classifiers which aim is to estimate y’s range. The observation X is handled by the corresponding local regressor which will evaluate the value of y on this range and its neighbors.

3.1 Ordinal Classification

The regression of an affect value y on an observation X can be bounded by the minimum and the maximum this value might take. The interval on which y is defined, I = [min(y), max(y)], can be divided in n ranges.

The first stage of the cascade is an ensemble of (n − 1) binary classifiers. Each classifier decides if, given the observation X, the variable to be predicted is higher than the lower boundary bi of a range or not. Training samples are labeled −1 if their y value is lower than bi and +1 otherwise. Considering the sorted nature of the boundaries bi, we build here an ensemble of ordinal classifiers [7].

We combine the decisions of these classifiers to compute the lower and upper bounds of the optimal range [bi, bi+1[. Consider an observation X with y = 0.15. Suppose the number of ranges n = 6 and linearly distributed boundaries bi. The following ranges are defined: [−1.0, −0.5, −0.25, 0, 0.25, 0.5, 1.0]. In case of perfect classification, the output vector of the ensemble of classifiers would be: {1, 1, 1, 1, −1, −1} where −1 means “y is lower than bi” while +1 means “y is higher than bi”. Obviously, bi is the bound associated to the “latest” classifier with a positive output and bi+1 the first classifier with a negative output. By combining the local decisions of these binary classifiers, we get the (optimal) range [bi, bi+1[. This range Ci will be used in the second stage to locally predict y. In the example, this range is [0, 0.25[. However, indecision between two classifiers can happen [16]. This indecision will be handled by the second stage of the cascade.

The performance measure of the ordinal classifiers, the accuracy, is directly linked to the definition of the ranges. The choice of the number of ranges n is a key parameter of our system and can be seen as a hyper-parameter. The n ranges and their corresponding boundaries bi can be defined in several ways. If they are linearly distributed, they will define a kind of affective scale as in [3]. But the choice of the boundaries bi could also prevent strong imbalances between classes. In case of highly imbalanced classes, the application of a data augmentation method is strongly recommended [4].

From now on, we can evaluate the accuracy (ranges detection rates) of the classifier combination. It can also be used to compute a discrete estimate of y, by using the center of the predicted range as the value ŷ. Finally, we can estimate the correlation of ŷ to the ground truth y.

3.2 Local Regression

The aim of the second stage of the cascade is to compute the continuous value of y. Thus, each range i is associated to a regressor Ri that locally regresses y on [bi,bi+1]. So, each regressor is specialized in the regression on a specific range. However, as explained previously, indecisions between nearby classes throughout the ordinal classification may induce an improper prediction of the range. De facto, the wrong regressor can be activated, causing a drop of the correlation. The analysis of the first stage results, illustrated by the confusion matrix (Fig. 2), indicates that prediction mistakes are close enough or even connected to the optimal range which y belongs to. Thus, we can expand the regression range to [bi1, bi+2], if they exist.

Fig. 2.
figure 2

Confusion matrix of the first stage of the cascade

Widening the local regression ranges helps to solve the indecision issue between the nearby boundaries. Moreover, it frees us from the obligation to strongly optimize the first stage. In fact, the use of a perfect classifier instead of a classifier that reaches an accuracy of 90% on the first stage won’t have a significant impact on the result of the whole cascade.

4 Databases

4.1 AVEC’2014

The AVEC’2014 database is an ensemble of audio and video recordings of human/machine interaction [23]. This base is composed of 150 recordings, each of them containing the reactions of only one person, realized from 84 German subjects. The age of the subjects varies between 18 and 63 years old. In order to create this dataset, a part of the subjects has been recorded many times with a break of two weeks between each recording session. The distribution of the records is arranged as following: 18 subjects have been recorded three times, 31 of them have been recorded twice and the 34 lefts have been recorded only once. In these recordings, the subjects had to realize two tasks:

  • NORTHWIND – The participants read out loud an extract of “Die Sonne und der Wind” (The North Wind and the Sun)

  • FREEFORM – The participants answer numerous questions such as: “What is your favorite meal?”; “Tell us a story about your childhood.”

Then the recordings are split in 3 parts: learning set, validation set, and test set, in which 150 couples of Freeform-Northwind are equally distributed. Low-level descriptors are described in Table 1.

Table 1. 42 acoustic low-level descriptors (LLD); 1 computed on voiced and unvoiced frames, respectively; 2 computed on voiced, unvoiced and all frames, respectively [17].

4.2 AV+EC’2015/RECOLA

The second dataset we used to measure the performances of our system is the affect recognition challenge AV+EC’2015 [17]. The AV+EC’2015 relies on the RECOLA base. This one is composed of a set of 9.5 h of audio, video and physiologic recordings (ECG, EDA) from 46 records of French people from different origins (Italian, German, and French) and different genders. The AV+EC’2015 relies on a sub-set of 27 recordings completely labelled. In our case, we only used the audio recordings and only worked on the valence which is known as the most complex to be predicted.

The learning, development and testing partitions contain 9 recordings each. The diversity of origins and genders of the subjects has been preserved in these. The different audio features used are available in the AV+EC’2015 presentation paper [17].

4.3 Data Augmentation

The study of the valence on a bounded interval allows the identification of several intensity thresholds of the felt emotion. Then, we can qualify this as very negative, negative, neutral, positive, and very positive, depending on this value. However, for the AVEC’2014 and the AV+EC’2015/RECOLA bases, these intensity thresholds are no equally represented. Figure 3 shows a clear unbalance of classes, favoring the representation of observations corresponding to a neutral valence (between −0.1 and 0.1). Considering the fact that some systems poorly support strong unbalances of classes [19], we increased the volume of data using the Synthetic Minority Over-sampling Technique [4].

Fig. 3.
figure 3

Valence distribution on the training set with 10 ranges of [−1; +1].

5 Experimental Results

5.1 Performance Metrics

The cascade performances are directly linked to those of both stages. Thus, the performances of the ensemble of ordinal classifiers are measured by the accuracy. It measures the ratio of examples for which the interval has been correctly predicted. We use the confusion matrix in order to analyze the behavior of this system in a more precise way.

The performances of the ensemble of local regressors are measured using Pearson’s correlation (PC), gold standard metric of the challenge AVEC’2014 [23] on which we base our study. However, as these data are not normally distributed, we decided to measure the performances of our system with Spearman’s correlation (SC) and the concordance correlation coefficient (CCC) as well.

The experimental results presented in the following are computed on the development/validation set of the different databases. Due to the temporal nature of audio data, we have also decided to analyze the outputs of both stages on complete sequences and applied temporal smoothing to refine results.

5.2 Preliminary Results

As previously stated, our architecture is modular and adapted to any kind of classification or regression method. Throughout our experiments, we tried to use support vector machines (C-SVM with RBF kernels) and random forests (RF with 300 decision treesFootnote 1, attribute bagging on \( \sqrt {nfeatures} \)) as classifiers. The Table 2 presents the ordinal classification rate obtained by these two systems on the development sets of AVEC’2014 and AV+EC’2015, for the prediction of valence. We choose this affect variable because it is known to be particularly hard to predict (see baseline results in row 1). By taking the center of the predicted intervals as values of ŷ, we have been able to compute the correlations of these two systems. These correlations enable to compare the performances of our classifier ensemble to those of a unique “global” random forest regressor dealing with the whole interval [−1, +1].

Table 2. Valence prediction: comparison of different ordinal classifiers (SVM-OC and RF-OC), one global random forest regressor (RF-GR) and the challenge baseline [11, 22] on the development set. The performance measure is the Pearson correlation coefficient.

The results obtained on both databases encouraged us to continue with random forests rather than support vector machines. Indeed, the results returned by these are significantly sharper than the SVM ones, independently of the choice of the sub-intervals. For the same reasons, we have decided to use random forests to perform local regression.

5.3 Results on AVEC’2014

The Table 3 compares the performances of the different systems presented on the development base of the AVEC’2014, while using several number of ranges n.

Table 3. Valence prediction: impact of the number of ranges on performances of global regressor (GR), ordinal classifier (OC), local regressors (LR) and cascade (COCLR). LR performances are computed considering the classification as “perfect” (Accuracy = 1).

First, the interval I has been split here in 10 ranges: [−1.0, −0.4, −0.3, −0.2, −0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 1.0]. The most performant system in term of correlation is here, without a doubt, the ordinal classifier ensemble, where the values are the centers of the predicted ranges. It is as well relevant to point out that, despite the very high correlation of the local regressors alone, the COCLR system does not seem efficient.

Then, the interval I has been split into 6 ranges: [−1.0, −0.4, −0.2, 0.0, 0.2, 0.4, 1.0].

We compare the performances of the different systems on the AVEC 2014 development base. The most performant system, as far as the correlation is concerned, is still the ordinal classifier ensemble. However, the performance gap between the COCLR and the ordinal classifier ensemble has tightened. It is also noteworthy that the accuracy of the classification system has risen and the correlation of the local regressors alone, has slightly dropped.

Finally, the interval I here has been split into n = 4 ranges, [−1.0, 0.3, 0.0, 0.3, 1.0]. Previous conclusions on ordinal classifiers and local regressors remain the same. But this time, the COCLR cascade turned out to be significantly the most efficient one. The correlation related to this system is the highest obtained for every choice of intervals of any sort. These different results highlighted the importance of the choice of the number of ranges on which the COCLR system stands. It seems, as well, that the correlation of the local regressors alone decreases when we increase the size of the ranges, contrary to the accuracy of the classification system.

5.4 Results on AV+EC’2015/ RECOLA

As we did previously, we measured the performances of our system according to the different sub-intervals. Affect value varies within [−0.3, 1.0] so we discard classifier and regressors trained on ]−1.0,−0.3[. Throughout our tests, we used 3 groups of different sub-intervals. The biggest, composed of 8 ranges, is: [−0.3, −0.2, −0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 1.0]. The second one, composed of 5 ranges, is: [−0.3, −0.1, 0.0, 0.1, 0.3, 1.0]. Finally, the last one, composed of 3 ranges, is: [−0.3, 0.0, 0.3, 1.0]. The Table 4 presents a summary of these results.

Table 4. Valence prediction: best obtained models for each number of ranges on AV+EC’2015 development set. Challenge results [12] on the audio channel (AC) and their multimodal system (MM). The performance measure is the Pearson correlation coefficient.

We can observe that the results derived from the RECOLA database are similar to the ones the AVEC’2014. In fact, the most performant system remains the COCLR, when we chose a small number of ranges. The correlation obtained by the cascade of ordinal classifiers and local regressors for the valence on the development base is worth 0.67. As previously, we have observed a decline of the correlation of the local regressors and a rise of the accuracy of the first stage of the cascade when the size of the sub-intervals increased. Comparisons with challenge winner’s results [12] are encouraging. Though our cascade get lower results (0.675) than their multimodal system (0.725), it gets better result than those obtained on the audio channel only (0.529). These latter are similar to those of the first stage ordinal classifier (0.521).

Last but not least, our proposal is fast to train (<10 mn for 3 ranges) and evaluate (<0.1 ms) on an Intelcore I7-8 cores-3.4 GHz and doesn’t require a great amount of memory space (<1Go for 3 ranges).

5.5 Temporal Smoothing

As previously stated, the AVEC’2014 and AV+EC’2015 are based on audio recordings. As a result, the observations provided are temporally linked. Because the system we trained do not consider this characteristic, we have analyzed our results and the ground truth in a temporal way. The Fig. 4 presents the ground trough valences and the ones predicted by the ordinal classification system, according to the time on a sequence of the development base. It is outstanding that the system seems to only miss punctually. Indeed, it is exceptional that the system is majorly failing on a w time window wide enough. By rendering a temporal smoothing operation with a sliding window of size 5, we have been able to increase the performances of our system, as shown on Fig. 5.

Fig. 4.
figure 4

Comparison between the ground truth and the ordinal classification on a sequence of the AVEC’2014 database.

Fig. 5.
figure 5

Comparison of the results obtained before (right) and after running a temporal smoothing (left) on the output of RF-ordinal classifiers (resp. 25, 50, 150 and 300 trees).

6 Conclusions

We propose in this article an original approach for the regression of a continuous, bounded variable, based on a cascade of ordinal classifiers and local regressors. We chose to applicate it to the estimation of affective variables such as the valence. The first stage allows us to predict a trend, depending to the chosen interval. Thus, taking into account, for example, four intervals, the emotional state of a person will be qualified as very negative, negative, positive or very positive. We have been able to observe that this trend is more accurately estimated while the number of interval is increasing. The second stage enable a sharper prediction of the variable by regressing locally, on its interval and its direct neighbors. It seems even more efficient when the number of considered interval is low. Indeed, it allows to reduce the influence of the first stage on the prediction. Finally, we showed that the performances of this cascade can be compared favorably to the ones of the winner of the challenge AV+EC’2015.

Despite these satisfying results, there are still room to improve it (others than applying it to the prediction of the arousal and the – running – assessment of the performances on the challenges test data). The COCLR is a cascade which first stage is an ensemble of classifiers. The decision here is sanctioned by the least performant classifier. A more adapted combination rule would impact advantageously the global performances. The outputs (binary or probabilistic) of the ordinal classifier might also enrich the descriptors used by the local regressors.

To conclude, the research introduces a cascading architecture which obtains promising results on a challenging dataset. Several hypotheses have been issued concerning the impact of the different parameters involved, but none of them has been generalized yet. Testing this architecture on other datasets would help us to validate these hypotheses and justify the general interest of this proposal.