Cascade of Ordinal Classification and Local Regression for Audio-Based Affect Estimation

Sazadaly, Maxime; Pinchon, Pierre; Fagot, Arthur; Prevost, Lionel; Maumy-Bertrand, Myriam

doi:10.1007/978-3-319-99978-4_21

Maxime Sazadaly¹⁶,
Pierre Pinchon¹⁶,
Arthur Fagot¹⁶,
Lionel Prevost¹⁶ &
…
Myriam Maumy-Bertrand¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11081))

Included in the following conference series:

IAPR Workshop on Artificial Neural Networks in Pattern Recognition

2812 Accesses

Abstract

Affective dimensions (i.e. valence, arousal, etc.) are continuous, real variables, bounded on [−1,+1]. They give insights on people emotional state. Literature showed that regressing these variables is a complex problem due to their variability. We propose here a two-step process. First, an ensemble of ordinal classifiers predicts the optimal range within [−1, +1] and a discrete estimate of the variable. Then, a regressor is trained locally on this range and its neighbors and provides a finer continuous estimate. Experiments on audio data from AVEC’2014 and AV+EC’2015 challenges show that this cascading process can be compared favorably with state of art and challengers results.

You have full access to this open access chapter, Download conference paper PDF

Fast and Accurate Affect Prediction Using a Hierarchy of Random Forests

Speech Emotion Recognition Using Multiple Classifiers

Combining modality-specific extreme learning machines for emotion recognition in the wild

Article 01 May 2015

Keywords

1 Introduction

Nowadays, vocal recognition of emotions has multiple applications in domains as diverse as medicine, telecommunications or transport [1]. For example, in telecommunications, it would become possible to priorities the calls from individuals in imminent danger situations over less relevant ones. In general, emotion recognition enables the improvement of human/machine interfaces, which justifies the unexpected increase of research on this field, due to the progresses in artificial learning.

Human interactions rely on multiple sources: body language, facial expressions, etc. A vocal message carries a lot of information that we translate implicitly. This information can be expressed or perceived verbally, but also non-verbally, through the tone, the volume or the speed of the voice. The automatic analysis of such information gives insights on the speaker emotional state.

The conceptualization of emotions is still a hot topic in psychology. Opinions do not converge towards a unique model. In fact, we can mainly differentiate three approaches [9]: (1) the basic emotions (Anger, Disgust, Fear, Happiness, Sadness, Surprise) described by Ekman [6], (2) the circumplex model of affect and (3) the appraisal theory. In the second model, the affective state is generally described, at least, by two dimensions: the valence which determines the positivity of the emotion and the arousal which determines the activity of the emotion [18, 23]. These two values, bounded on [−1,+1], describe much more precisely the emotional state of an individual than the basic emotions. However, it has been shown that other dimensions were necessary to report more accurately this state during an interaction [8].

The choice of one model or the other restrains the kind of machine learning algorithms used to estimate the emotional state. In case of basic emotions, the variable to be predicted is qualitative and nominal. Classification methods must be used. On the contrary, affective dimensions are quantitative, continuous, and bounded variables. So, regression predictor will be needed. To take advantage of the best of both worlds, we propose in this study a method that combines classification and regression. To predict a continuous and bounded variable, we first quantize the affect variable into bounded ranges. For example, a 5 ranges valence quantization would give the following boundaries {−1, −0.5, −0.2, +0.2, +0.5, +1}. It could be interpreted as “very negative”, “negative”, “neutral”, “positive” and “very positive”. Then, we proceed into 3 steps:

Train an ensemble of classifiers to estimate if the affect variable associated to an observation is higher than a given boundary;
Combine the ensemble decisions to predict the optimal range;
Regress locally the affect variable on this range.

The proposed method is therefore a cascade of ordinal classifiers and local regressors (COCLR). We will see in the following state of the art that similar proposals have been made. But in this paper, we perform a thorough study on the key parameter of this method: the number of ranges to be separated by the ensemble of ordinal classifiers. We show experimentally that:

On small and numerous ranges, ordinal classification performs well;
On large ranges, the COCLR cascade performs better;
On challenging databases (AVEC’2014 [23] and AV+EC’2015 [17], described in Sect. 4), the COCLR cascade can be compared favorably with challengers’ and winner’s proposals with an acceptable development and computational cost.

This paper is organized as follows. Section 2 focuses on the state of the art on affect prediction on audio data. In Sect. 3, we will present the COCLR flowchart. In Sect. 4, we will introduce the datasets used to train and evaluate our system and the different pre-processing realized. Then, in Sect. 5, we will expose and discuss our results. Finally, Sect. 6 offers some conclusions.

2 State of Art

The Audio-Visual Emotion recognition Challenges (AVEC), that takes place every year since 2011, enables to assess the systems proposed on similar datasets. The main objective of these challenges is to ensure a fair comparison between research teams by using the same data. Particularly, the unlabeled test set is released to registered participants some days before the challenge deadline. Moreover, the organizers provide to the competitors a set of audio and video descriptors extracted by approved methods.

The prosodic features such as the height, the intensity, the speech rate, and the quality of the voice, are important to identify the different types of emotions. Low level acoustic descriptors like energy, spectrum, cepstral coefficients, formants, etc. enable an accurate description of the signal [23]. Furthermore, it has recently been demonstrated that features learned by the first layers of deep convolutional networks were quite similar to some acoustic descriptors [22].

2.1 Emotion Classification and Prediction

The classification of emotion is done through classical methods like support vector machines (SVM) [2], Gaussian mixture models (GMM) [20] or random forests (RF) [15]. For regression tasks, numerous models have been proposed: support vector regressors (SVR) [5], deep belief networks (DBN) [13], bidirectional long-short term memory networks (BLSTM) [14], etc. As all these models having their own pros and cons, recent works focus on model combinations to improve overall accuracy. Thus, in [11], authors propose to associate BLSTM and SVR to benefit from the treatment of the past/present context of the BLSTM and the generalization ability of the SVR.

AV+EC’2015 challenge winners proposed in [12] a hierarchy of BLSTM. They deal with 4 information channels: audio, video (described by frame-by-frame geometric features and temporal appearance features), electrocardiogram and electro dermal activity. They combine the predictions of single-modal deep BLSTM with a multimodal deep BLSTM that perform the final affect prediction.

2.2 Ordinal Classification and Hierarchical Prediction

The standard approach to ordinal classification converts the class value into a numeric quantity and applies a regression learner to the transformed data, translating the output back into a discrete class value in a post-processing step [7]. Here, we work directly on numerical values of affect variables but quantify them into several ranges. Recently, a discrete classification of continuous affective variables through generative adversarial networks (GAN) has been proposed [3]. Five ranges are considered.

The idea of a combining regressors and classifiers has already been applied to deal with age estimation from images. In [10], a first “global” regression is done with a SVR on all ages. Then, it is refined by locally adjusting the age regressed value by using an SVM. In [21] authors propose another hierarchy on the same issue. They define 3 age ranges (namely “child”, “teen” an “adult”). An image is classified by combining the results of a pool of classifiers (SVC, FLD, PLS, NN and naïve Bayes) in a majority rule. Then, a second stage uses the appropriate relevant vector machine regression model (trained on one age range) to estimate the age.

The idea of such a hierarchy is not new, but its application to affect data, have not been proposed yet. Moreover, we show in the following experiments that the number of boundaries to be considered impacts the performance of the whole hierarchy.

3 Cascade of Ordinal Classifiers and Local Regressors

The cascade of ordinal classifiers and local regressors proposed here is a hybrid combination of classification and regression systems. Let us note X, the observation (feature vector), y the affective variable to be predicted (valence or arousal) and \( \widehat{y} \), the prediction. The variable y is continuous and defined on the bounded interval [−1, +1]. Therefore, it is possible to segment this interval into a set of n smaller sub-intervals called “ranges” in the following, bounded by the boundaries bi and bi+1 with i ∈ {1, n + 1}. For example, n = 2 define 2 ranges: [−1, 0[ (“negative”) and [0, +1] (“positive”) and 3 boundaries bi ∈ {−1, 0, +1}. Each boundary bi (except −1 and +1) may define a binary classification issue: given the observation X, the prediction \( \widehat{y} \) is lower (resp. higher) than bi. By combining the outputs of the (n − 1) binary classifiers, we get an ordinal classification. Given the observation X, the prediction \( \widehat{y} \) is probably (necessarily in case of perfect classification) located within the range [bi, bi + 1[. Once this range obtained, a local regression is run on it along to its direct neighbors to predict y. Figure 1 illustrates the full cascade. The structure of this system is modular and compatible with any kind of classification and regression algorithms. Moreover, it is generic and may be adapted to other subjects than affective dimension prediction.

3.1 Ordinal Classification

The regression of an affect value y on an observation X can be bounded by the minimum and the maximum this value might take. The interval on which y is defined, I = [min(y), max(y)], can be divided in n ranges.

The first stage of the cascade is an ensemble of (n − 1) binary classifiers. Each classifier decides if, given the observation X, the variable to be predicted is higher than the lower boundary b_i of a range or not. Training samples are labeled −1 if their y value is lower than b_i and +1 otherwise. Considering the sorted nature of the boundaries b_i, we build here an ensemble of ordinal classifiers [7].

We combine the decisions of these classifiers to compute the lower and upper bounds of the optimal range [b_i, b_i+1[. Consider an observation X with y = 0.15. Suppose the number of ranges n = 6 and linearly distributed boundaries b_i. The following ranges are defined: [−1.0, −0.5, −0.25, 0, 0.25, 0.5, 1.0]. In case of perfect classification, the output vector of the ensemble of classifiers would be: {1, 1, 1, 1, −1, −1} where −1 means “y is lower than b_i” while +1 means “y is higher than b_i”. Obviously, b_i is the bound associated to the “latest” classifier with a positive output and b_i+1 the first classifier with a negative output. By combining the local decisions of these binary classifiers, we get the (optimal) range [b_i, b_i+1[. This range C_i will be used in the second stage to locally predict y. In the example, this range is [0, 0.25[. However, indecision between two classifiers can happen [16]. This indecision will be handled by the second stage of the cascade.

The performance measure of the ordinal classifiers, the accuracy, is directly linked to the definition of the ranges. The choice of the number of ranges n is a key parameter of our system and can be seen as a hyper-parameter. The n ranges and their corresponding boundaries b_i can be defined in several ways. If they are linearly distributed, they will define a kind of affective scale as in [3]. But the choice of the boundaries b_i could also prevent strong imbalances between classes. In case of highly imbalanced classes, the application of a data augmentation method is strongly recommended [4].

From now on, we can evaluate the accuracy (ranges detection rates) of the classifier combination. It can also be used to compute a discrete estimate of y, by using the center of the predicted range as the value ŷ. Finally, we can estimate the correlation of ŷ to the ground truth y.

3.2 Local Regression

The aim of the second stage of the cascade is to compute the continuous value of y. Thus, each range i is associated to a regressor R_i that locally regresses y on [b_i,b_i+1]. So, each regressor is specialized in the regression on a specific range. However, as explained previously, indecisions between nearby classes throughout the ordinal classification may induce an improper prediction of the range. De facto, the wrong regressor can be activated, causing a drop of the correlation. The analysis of the first stage results, illustrated by the confusion matrix (Fig. 2), indicates that prediction mistakes are close enough or even connected to the optimal range which y belongs to. Thus, we can expand the regression range to [b_i−1, b_i+2], if they exist.

Widening the local regression ranges helps to solve the indecision issue between the nearby boundaries. Moreover, it frees us from the obligation to strongly optimize the first stage. In fact, the use of a perfect classifier instead of a classifier that reaches an accuracy of 90% on the first stage won’t have a significant impact on the result of the whole cascade.

4 Databases

4.1 AVEC’2014

The AVEC’2014 database is an ensemble of audio and video recordings of human/machine interaction [23]. This base is composed of 150 recordings, each of them containing the reactions of only one person, realized from 84 German subjects. The age of the subjects varies between 18 and 63 years old. In order to create this dataset, a part of the subjects has been recorded many times with a break of two weeks between each recording session. The distribution of the records is arranged as following: 18 subjects have been recorded three times, 31 of them have been recorded twice and the 34 lefts have been recorded only once. In these recordings, the subjects had to realize two tasks:

NORTHWIND – The participants read out loud an extract of “Die Sonne und der Wind” (The North Wind and the Sun)
FREEFORM – The participants answer numerous questions such as: “What is your favorite meal?”; “Tell us a story about your childhood.”

Then the recordings are split in 3 parts: learning set, validation set, and test set, in which 150 couples of Freeform-Northwind are equally distributed. Low-level descriptors are described in Table 1.

Table 1. 42 acoustic low-level descriptors (LLD); 1 computed on voiced and unvoiced frames, respectively; 2 computed on voiced, unvoiced and all frames, respectively [17].

Full size table

4.2 AV+EC’2015/RECOLA

The second dataset we used to measure the performances of our system is the affect recognition challenge AV+EC’2015 [17]. The AV+EC’2015 relies on the RECOLA base. This one is composed of a set of 9.5 h of audio, video and physiologic recordings (ECG, EDA) from 46 records of French people from different origins (Italian, German, and French) and different genders. The AV+EC’2015 relies on a sub-set of 27 recordings completely labelled. In our case, we only used the audio recordings and only worked on the valence which is known as the most complex to be predicted.

The learning, development and testing partitions contain 9 recordings each. The diversity of origins and genders of the subjects has been preserved in these. The different audio features used are available in the AV+EC’2015 presentation paper [17].

4.3 Data Augmentation

The study of the valence on a bounded interval allows the identification of several intensity thresholds of the felt emotion. Then, we can qualify this as very negative, negative, neutral, positive, and very positive, depending on this value. However, for the AVEC’2014 and the AV+EC’2015/RECOLA bases, these intensity thresholds are no equally represented. Figure 3 shows a clear unbalance of classes, favoring the representation of observations corresponding to a neutral valence (between −0.1 and 0.1). Considering the fact that some systems poorly support strong unbalances of classes [19], we increased the volume of data using the Synthetic Minority Over-sampling Technique [4].

5 Experimental Results

5.1 Performance Metrics

The cascade performances are directly linked to those of both stages. Thus, the performances of the ensemble of ordinal classifiers are measured by the accuracy. It measures the ratio of examples for which the interval has been correctly predicted. We use the confusion matrix in order to analyze the behavior of this system in a more precise way.

The performances of the ensemble of local regressors are measured using Pearson’s correlation (PC), gold standard metric of the challenge AVEC’2014 [23] on which we base our study. However, as these data are not normally distributed, we decided to measure the performances of our system with Spearman’s correlation (SC) and the concordance correlation coefficient (CCC) as well.

The experimental results presented in the following are computed on the development/validation set of the different databases. Due to the temporal nature of audio data, we have also decided to analyze the outputs of both stages on complete sequences and applied temporal smoothing to refine results.

5.2 Preliminary Results

As previously stated, our architecture is modular and adapted to any kind of classification or regression method. Throughout our experiments, we tried to use support vector machines (C-SVM with RBF kernels) and random forests (RF with 300 decision trees^{Footnote 1}, attribute bagging on \( \sqrt {nfeatures} \)) as classifiers. The Table 2 presents the ordinal classification rate obtained by these two systems on the development sets of AVEC’2014 and AV+EC’2015, for the prediction of valence. We choose this affect variable because it is known to be particularly hard to predict (see baseline results in row 1). By taking the center of the predicted intervals as values of ŷ, we have been able to compute the correlations of these two systems. These correlations enable to compare the performances of our classifier ensemble to those of a unique “global” random forest regressor dealing with the whole interval [−1, +1].

Table 2. Valence prediction: comparison of different ordinal classifiers (SVM-OC and RF-OC), one global random forest regressor (RF-GR) and the challenge baseline [11, 22] on the development set. The performance measure is the Pearson correlation coefficient.

Full size table

The results obtained on both databases encouraged us to continue with random forests rather than support vector machines. Indeed, the results returned by these are significantly sharper than the SVM ones, independently of the choice of the sub-intervals. For the same reasons, we have decided to use random forests to perform local regression.

5.3 Results on AVEC’2014

The Table 3 compares the performances of the different systems presented on the development base of the AVEC’2014, while using several number of ranges n.

Table 3. Valence prediction: impact of the number of ranges on performances of global regressor (GR), ordinal classifier (OC), local regressors (LR) and cascade (COCLR). LR performances are computed considering the classification as “perfect” (Accuracy = 1).

Full size table

First, the interval I has been split here in 10 ranges: [−1.0, −0.4, −0.3, −0.2, −0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 1.0]. The most performant system in term of correlation is here, without a doubt, the ordinal classifier ensemble, where the values are the centers of the predicted ranges. It is as well relevant to point out that, despite the very high correlation of the local regressors alone, the COCLR system does not seem efficient.

Then, the interval I has been split into 6 ranges: [−1.0, −0.4, −0.2, 0.0, 0.2, 0.4, 1.0].

We compare the performances of the different systems on the AVEC 2014 development base. The most performant system, as far as the correlation is concerned, is still the ordinal classifier ensemble. However, the performance gap between the COCLR and the ordinal classifier ensemble has tightened. It is also noteworthy that the accuracy of the classification system has risen and the correlation of the local regressors alone, has slightly dropped.

Finally, the interval I here has been split into n = 4 ranges, [−1.0, 0.3, 0.0, 0.3, 1.0]. Previous conclusions on ordinal classifiers and local regressors remain the same. But this time, the COCLR cascade turned out to be significantly the most efficient one. The correlation related to this system is the highest obtained for every choice of intervals of any sort. These different results highlighted the importance of the choice of the number of ranges on which the COCLR system stands. It seems, as well, that the correlation of the local regressors alone decreases when we increase the size of the ranges, contrary to the accuracy of the classification system.

5.4 Results on AV+EC’2015/ RECOLA

As we did previously, we measured the performances of our system according to the different sub-intervals. Affect value varies within [−0.3, 1.0] so we discard classifier and regressors trained on ]−1.0,−0.3[. Throughout our tests, we used 3 groups of different sub-intervals. The biggest, composed of 8 ranges, is: [−0.3, −0.2, −0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 1.0]. The second one, composed of 5 ranges, is: [−0.3, −0.1, 0.0, 0.1, 0.3, 1.0]. Finally, the last one, composed of 3 ranges, is: [−0.3, 0.0, 0.3, 1.0]. The Table 4 presents a summary of these results.

Table 4. Valence prediction: best obtained models for each number of ranges on AV+EC’2015 development set. Challenge results [12] on the audio channel (AC) and their multimodal system (MM). The performance measure is the Pearson correlation coefficient.

Full size table

We can observe that the results derived from the RECOLA database are similar to the ones the AVEC’2014. In fact, the most performant system remains the COCLR, when we chose a small number of ranges. The correlation obtained by the cascade of ordinal classifiers and local regressors for the valence on the development base is worth 0.67. As previously, we have observed a decline of the correlation of the local regressors and a rise of the accuracy of the first stage of the cascade when the size of the sub-intervals increased. Comparisons with challenge winner’s results [12] are encouraging. Though our cascade get lower results (0.675) than their multimodal system (0.725), it gets better result than those obtained on the audio channel only (0.529). These latter are similar to those of the first stage ordinal classifier (0.521).

Last but not least, our proposal is fast to train (<10 mn for 3 ranges) and evaluate (<0.1 ms) on an Intelcore I7-8 cores-3.4 GHz and doesn’t require a great amount of memory space (<1Go for 3 ranges).

5.5 Temporal Smoothing

As previously stated, the AVEC’2014 and AV+EC’2015 are based on audio recordings. As a result, the observations provided are temporally linked. Because the system we trained do not consider this characteristic, we have analyzed our results and the ground truth in a temporal way. The Fig. 4 presents the ground trough valences and the ones predicted by the ordinal classification system, according to the time on a sequence of the development base. It is outstanding that the system seems to only miss punctually. Indeed, it is exceptional that the system is majorly failing on a w time window wide enough. By rendering a temporal smoothing operation with a sliding window of size 5, we have been able to increase the performances of our system, as shown on Fig. 5.

6 Conclusions

We propose in this article an original approach for the regression of a continuous, bounded variable, based on a cascade of ordinal classifiers and local regressors. We chose to applicate it to the estimation of affective variables such as the valence. The first stage allows us to predict a trend, depending to the chosen interval. Thus, taking into account, for example, four intervals, the emotional state of a person will be qualified as very negative, negative, positive or very positive. We have been able to observe that this trend is more accurately estimated while the number of interval is increasing. The second stage enable a sharper prediction of the variable by regressing locally, on its interval and its direct neighbors. It seems even more efficient when the number of considered interval is low. Indeed, it allows to reduce the influence of the first stage on the prediction. Finally, we showed that the performances of this cascade can be compared favorably to the ones of the winner of the challenge AV+EC’2015.

Despite these satisfying results, there are still room to improve it (others than applying it to the prediction of the arousal and the – running – assessment of the performances on the challenges test data). The COCLR is a cascade which first stage is an ensemble of classifiers. The decision here is sanctioned by the least performant classifier. A more adapted combination rule would impact advantageously the global performances. The outputs (binary or probabilistic) of the ordinal classifier might also enrich the descriptors used by the local regressors.

To conclude, the research introduces a cascading architecture which obtains promising results on a challenging dataset. Several hypotheses have been issued concerning the impact of the different parameters involved, but none of them has been generalized yet. Testing this architecture on other datasets would help us to validate these hypotheses and justify the general interest of this proposal.

Notes

1.
Sensitivity analysis on the number of decision trees is presented in Fig. 5.

References

Basu, S., Chakraborty, J., Bag, A., Aftabuddin, M.: A review on emotion recognition using speech. In: International Conference on Inventive Communication and Computational Technologies (ICICCT), pp. 109–114 (2017)
Google Scholar
Bitouk, D., Verma, R., Nenkova, A.: Class-level spectral features for emotion recognition. Speech Commun. 52(7), 613–625 (2010)
Article Google Scholar
Chang, J., Scherer, S.: Learning representations of emotional speech with deep convolutional generative adversarial networks. In: ICASSP, pp. 2746–2750 (2017)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, P.W.: SMOTE: synthetic minority oversampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Drucker, H., Burges, C.J., Kaufman, L., Smola, A.J., Vapnik, V.: Support vector regression machines. In: Advances in Neural Information Processing Systems, pp. 155–161 (1997)
Google Scholar
Ekman, P.: Basic emotions. In: Handbook of Cognition and Emotion, pp. 45–60. Wiley (1999)
Google Scholar
Frank, E., Hall, M.: A simple approach to ordinal classification. In: De Raedt, L., Flach, P. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 145–156. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44795-4_13
Chapter Google Scholar
Fontaine, J.R., Scherer, K.R., Roesch, E.B., Ellsworth, P.C.: The world of emotions is not two-dimensional. Psychol. Sci. 18(12), 1050–1057 (2007)
Article Google Scholar
Grandjean, D., Sander, D., Scherer, K.R.: Conscious emotional experience emerges as a function of multilevel, appraisal-driven response synchronization. Conscious. Cognit. 17(2), 484–495 (2008)
Article Google Scholar
Guo, G., Fu, Y., Huang, T.S., Dyer, C.R.: Locally adjusted robust regression for human age estimation. In: WACV (2008)
Google Scholar
Han, J., Zhang, Z., Ringeval, F., Schuller, B.: Prediction-based learning for continuous emotion recognition in speech. In: ICASSP, pp. 5005–5009 (2017)
Google Scholar
He, L., Jiang, D., Yang, L., Pei, E., Wu, P., Sahli, H.: Multimodal affective dimension prediction using deep bidirectional long short-term memory recurrent neural networks. In: AVEC, pp. 73–80 (2015)
Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Article MathSciNet Google Scholar
Nicolaou, M.A., Gunes, H., Pantic, M.: Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affect. Comput. 2(2), 92–105 (2011)
Article Google Scholar
Noroozi, F., Sapinski, T., Kaminska, D., Anbarjafari, G.: Vocal-based emotion recognition using random forests and decision tree. Int. J. Speech Technol. 20(2), 239–246 (2017)
Article Google Scholar
Qiao, X.: Noncrossing Ordinal Classification. arXiv:1505.03442 (2015)
Ringeval, F., et al.: AV+EC 2015: the first affect recognition challenge bridging across audio, video, and physiological data. In: AVEC, pp. 3–8 (2015)
Google Scholar
Russell, J.: A circumplex model of affect. J. Pers. Soc. Psychol. 39(6), 1161–1178 (1980)
Article Google Scholar
Saranya, R., Yamini, C.: Survey on ensemble alpha tree for imbalance Classification problem
Google Scholar
Sethu, V., Ambikairajah, E., Epps, J.: Empirical mode decomposition based weighted frequency feature for speech-based emotion classification. In: ICASSP, pp. 5017–5020 (2008)
Google Scholar
Thukral, P., Mitra, K., Chellappa, R.: A hierarchical approach for human age estimation. In: ICASSP, pp. 1529 –1532 (2012)
Google Scholar
Trigeorgis, G., et al.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: ICASSP, pp. 5200–5204 (2016)
Google Scholar
Valstar, M.F., et al.: Avec 2014: 3D dimensional affect and depression recognition challenge. In: AVEC (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Learning, Data and Robotics Lab, ESIEA, Ivry-sur-Seine, France
Maxime Sazadaly, Pierre Pinchon, Arthur Fagot & Lionel Prevost
Centre National de la Recherche Scientifique, Institut de Recherche Mathématique Avancé, Université de Strasbourg, Strasbourg, France
Myriam Maumy-Bertrand

Authors

Maxime Sazadaly
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Pinchon
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Fagot
View author publications
You can also search for this author in PubMed Google Scholar
Lionel Prevost
View author publications
You can also search for this author in PubMed Google Scholar
Myriam Maumy-Bertrand
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lionel Prevost .

Editor information

Editors and Affiliations

University of Siena, Siena, Italy
Luca Pancioni
Ulm University, Ulm, Germany
Friedhelm Schwenker
University of Siena, Siena, Italy
Edmondo Trentin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sazadaly, M., Pinchon, P., Fagot, A., Prevost, L., Maumy-Bertrand, M. (2018). Cascade of Ordinal Classification and Local Regression for Audio-Based Affect Estimation. In: Pancioni, L., Schwenker, F., Trentin, E. (eds) Artificial Neural Networks in Pattern Recognition. ANNPR 2018. Lecture Notes in Computer Science(), vol 11081. Springer, Cham. https://doi.org/10.1007/978-3-319-99978-4_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-99978-4_21
Published: 30 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99977-7
Online ISBN: 978-3-319-99978-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Cascade of Ordinal Classification and Local Regression for Audio-Based Affect Estimation

Abstract

Similar content being viewed by others

Fast and Accurate Affect Prediction Using a Hierarchy of Random Forests

Speech Emotion Recognition Using Multiple Classifiers

Combining modality-specific extreme learning machines for emotion recognition in the wild

Keywords

1 Introduction

2 State of Art

2.1 Emotion Classification and Prediction

2.2 Ordinal Classification and Hierarchical Prediction