Keywords

1 Introduction

In recent year, cost-sensitive learning has been studied widely and become one of the most important topics for solving the class imbalance problem [1]. In [2], Zhou et al. studied empirically the effect of sampling and threshold-moving in training cost-sensitive neural networks, and revealed threshold-moving and soft-ensemble are relatively good choices in training cost-sensitive neural networks. In [3], Sun et al. proposed a cost-sensitive boosting algorithms, which are developed by introducing cost items into the learning framework of AdaBoost. In [4], Jiang et al. proposed a novel Minority Cloning Technique (MCT) for class-imbalanced cost-sensitive learning. MCT alters the class distribution of training data by cloning each minority class instance according to the similarity between it and the mode of the minority class. In [5], a new cost-sensitive metric was proposed by George to find the optimal tradeoff between the two most critical performance measures of a classification task-accuracy and cost. Generally, users focus more on the minority class and consider the cost of misclassifying a minority class to be more expensive. In our study, we adopt the same strategy to addressing this problem.

Motivated by probabilistic collaborative representation based approach for pattern classification [6] and Zhang’s work [7], in this paper we propose a new method to handle misclassification cost and class-imbalance problem called Cost-sensitive Collaborative Representation based Classification via Probability Estimation Addressing the Class Imbalance Problem (CSCRC). In Zhang’s cost-sensitive learning framework, posterior probabilities of a testing sample are estimated by KLR or KNN method. In [6], Probabilistic Collaborative Representation based approach for pattern Classification (ProCRC) is designed to achieve the lowest recognition errors and assume the same losses for different types of misclassifications, it is difficult to resolve the class imbalance problem. For this case, we introduce cost-sensitive learning framework into ProCRC, which not only derive the relationship between Gaussian function and collaborative representation but also resolve the cost-sensitive problem in [6]. Firstly, we use the probabilistic collaborative representation framework to estimate the posterior probabilities. The posterior probabilities are generated directly from the coding coefficients by using a Gaussian function and applying the logarithmic operator to the probabilistic collaborative representation framework, this explained clearly the l2-norm regularized representation scheme used in collaborative representation based classifier (CRC). Secondly, calculating all the misclassification losses use Zhang’s cost-sensitive learning framework. At last, the test sample is assigned to the class whose loss is minimal. Experimental results on UCI databases validate the effectiveness and efficiency of our methods.

2 Proposed Approach

In Cai’s work [6], different data points x have different probabilities of \( l(x)\, \in \,l_{X} \), where l(x) means the label of x, lX means the label set of all candidate classes in X, and \( P\left( {l\left( x \right)\, \in \,l_{X} } \right) \) should be higher if the l2-norm of \( \alpha \) is smaller, vice versa. One intuitive choice is to use a Gaussian function to define such a probability:

$$ P\left( {l(x)\, \in \,l_{X} } \right)\,{ \propto }\,exp\left( { - c\left\| \alpha \right\|_{2}^{2} } \right) $$
(1)

where c is a constant and data points are assigned different probabilities based on \( \alpha \), where all the data points are inside the subspace spanned by all samples in X. For a sample y outside the subspace, the probability as:

$$ P\left( {l(y) \in l_{X} } \right) = P\left( {l(y) = l(x)\left| {l(x) \in l_{X} } \right.} \right)P\left( {l(x) \in l_{X} } \right) $$
(2)

\( P(l(x) \in l_{X} ) \) has been defined in Eq. (7). \( P(l(y) = l(x)\left| {l(x) \in l_{X} } \right.) \) can be measured by the similarity between x and y. Here we adopt the Gaussian kernel to define it:

$$ P(l(y) = l(x)\left| {l(x) \in l_{X} } \right.)\, \propto \,exp( - k\left\| {y - x} \right\|_{2}^{2} ) $$
(3)

where k is a constant, with Eqs. (1)–(3), we have

$$ P(l(y) \in l_{X} )\, \propto \,exp( - (k\left\| {y - X\alpha } \right\|_{2}^{2} + c\left\| \alpha \right\|_{2}^{2} )) $$
(4)

In order to maximize the probability, we can apply the logarithmic operator to Eq. (4). There is:

$$ \begin{aligned} & \hbox{max} P(l(y)\, \in \,l_{X} ) = \hbox{max} \ln (P(l(y)\, \in \,l_{X} )) \\ & = min_{\alpha } k\left\| {y - X\alpha } \right\|_{2}^{2} + c\left\| \alpha \right\|_{2}^{2} \\ & = min_{\alpha } \left\| {y - X\alpha } \right\|_{2}^{2} + \lambda \left\| \alpha \right\|_{2}^{2} \\ \end{aligned} $$
(5)

where \( \lambda = c/k \). Interestingly, Eq. (5) shares the same formulation of the representation formula of CRC [4], but it has a clear probabilistic interpretation.

A sample x inside the subspace can be collaboratively represented as: \( x = X\alpha = \sum\nolimits_{k = 1}^{K} {X_{k} \alpha_{k} } \), where \( \alpha = [\alpha_{1} ;\alpha_{2} ; \ldots ;\alpha_{k} ] \) and \( \alpha_{k} \) is the coding vector associated with Xk. Note that \( x_{k} = X_{k} \alpha_{k} \) is a data point falling into the subspace of class k. Then, we have

$$ P(l(x) = k\left| {l(x) \in l_{X} } \right.)\, \propto \,exp( - \delta \left\| {x - X_{k} \alpha_{k} } \right\|_{2}^{2} ) $$
(6)

where \( \delta \) is a constant. For a query sample y, we can compute the probability that \( l(y) = k \) as:

$$ \begin{aligned} & P(l(y) = k) \\ & = P(l(y) = l(x)\left| {l(x) = k} \right.) \cdot P(l(x) = k) \\ & = P(l(y) = l(x)\left| {l(x) = k} \right.) \cdot P(l(x) = k\left| {l(x) \in l_{X} } \right.) \cdot P(l(x) \in l_{X} ) \\ \end{aligned} $$
(7)

Since the probability definition in Eq. (3) is independent of k as long as \( k \in l_{X} \), we have \( P(l(y) = l(x)\left| {l(x) = k} \right.) = P(l(y) = l(x)\left| {l(x) \in l_{X} } \right.) \). With Eqs. (5)–(7), we have

$$ \begin{aligned} P(l(y) = k) = P(l(y) \in l_{X} ) \cdot P(l(x) = k\left| {l(x) \in l_{X} } \right.) \hfill \\ \, \propto \,exp( - \left\| {y - X\alpha } \right\|_{2}^{2} + \lambda \left\| \alpha \right\|_{2}^{2} + \gamma \left\| {X\alpha - X_{k} \alpha_{k} } \right\|_{2}^{2} ) \hfill \\ \end{aligned} $$
(8)

where \( \gamma = \delta /k \). Applying the logarithmic operator to Eq. (8) and ignoring the constant term, we have:

$$ (\hat{\alpha }) = \arg \min_{\alpha } \{ \left\| {y - X\alpha } \right\|_{2}^{2} + c\left\| \alpha \right\|_{2}^{2} + \left\| {X\alpha - X_{k} \alpha_{k} } \right\|_{2}^{2} \} $$
(9)

Refer to Eq. (9), let \( X^{\prime}_{k} \) be a matrix which has the same size as X, while only the samples of \( X_{k} \) will be assigned to \( X^{\prime}_{k} \) at their corresponding locations in X, i.e., \( X^{\prime}_{k} = \left[ {0, \ldots ,X_{k} , \ldots ,0} \right] \). Let \( \bar{X}^{\prime}_{k} = X - X^{\prime}_{k} \). We can then compute the following projection matrix offline:

$$ T = (X^{T} X + (\bar{X}^{\prime}_{k} )^{T} \bar{X}^{\prime}_{k} + \lambda I)^{ - 1} X^{T} $$
(10)

where I denotes the identity matrix. Then, \( \hat{\alpha } = Ty \).

With the model in Eq. (9), a solution vector \( \hat{\alpha } \) is obtained. The probability P(l(y) = k) can be computed by:

$$ P(l(y) = k) \propto \,exp( - (\left\| {y - X\hat{\alpha }} \right\|_{2}^{2} + \lambda \left\| {\hat{\alpha }} \right\|_{2}^{2} + \left\| {X\hat{\alpha } - X_{k} \hat{\alpha }_{k} } \right\|_{2}^{2} )) $$
(11)

Note that \( \left( {\left\| {y - X\hat{\alpha }} \right\|_{2}^{2} + \lambda \left\| {\hat{\alpha }} \right\|_{2}^{2} } \right) \) is the same for all classes, and thus we can omit it in computing P(l(y) = k). Then we have

$$ P_{k} = exp\left( { - \left( {\left\| {X\hat{\alpha } - X_{k} \hat{\alpha }_{k} } \right\|_{2}^{2} } \right)} \right) $$
(12)

In cost-sensitive learning, the loss function is regarded as an objective function to identify the label of a test sample. In binary classification problem, there are two misclassification costs, and we denote the cost that misclassify positive class as negative class by C10, and the cost by C01 conversely. Then a cost matrix can be constructed as shown in Table 1, where G1, G0 represents the label of minority class and majority class, respectively.

Table 1 The classification accuracy for the 5 methods on 10 data sets

It is well known that the loss function can be related to the posterior probability \( P (\phi (y )\left| y \right. )\, \approx \,P (l (y )= k ) \). Then the loss function can be rewritten as follow:

$$ loss (y,\phi (y ) ) { = }\left\{ {\begin{array}{*{20}c} {\sum\limits_{{i = G_{1} }} {P_{i} C_{10} \,\,{\text{if }}\phi (y )= G_{0} } } \\ {\sum\limits_{{i = G_{0} }} {P_{i} C_{01} \,\,{\text{if }}\phi (y )= G_{1} } } \\ \end{array} } \right. $$
(13)

The test sample y belongs to the class with higher probability. We can obtain the label of test sample y by minimizing Eq. (13):

$$ L (y ) {\text{ = arg }}\mathop { \hbox{min} }\limits_{{i \in {\text{\{ 0,1\} }}}} loss (y,\phi (y ) ) $$
(14)

3 Results

Experiment 1 We compare the performance of these 5 methods (sparse representation based classification (SRC), CRC, SVM, ProCRC, CSCRC) on 10 UCI data sets, and the results are summarized in Tables 1 and 2. The last row of Table 1 is the average Accuracy value for the method on ten data sets. We select 31 positive samples and 31 negative samples randomly from data sets Haberman, Housing, Ionosphere and Balance as test samples, 41 positive samples and 41 negative samples as training samples; 61 positive samples and 61 negative samples as test samples, 101 positive samples and 101 negative samples as training samples from the other 6 data sets. The cost ratio (the cost of false acceptance respect to false rejection) set as 10. We perform the process for 50 times and get the average results.

Table 2 The average cost for the 5 methods on 10 data sets

On Letter, Balance, Abalone, Car, Nursery, Cmc and Haberman, our method achieves very high Accuracy value respect to the other four methods. One of the three data sets does not get the highest value of Accuracy, but we achieve the highest value of average Accuracy. The values of accuracy are higher than 0.93. In other words, our method have better performance than SRC, CRC, SVM, ProCRC and CSCRC.

We calculate the misclassification cost of this 5 method on 10 UCI data sets and summarized as Table 2. On Letter, Balance, Abalone, Car, Pima, Nursery, Cmc and Haberman, our method achieves very low average misclassification cost. In Table 1, SRC has the highest value on Pima, but CSCRC has the highest value of Average Cost on Pima. Obviously, CSCRC classify the positive samples correctly. Furthermore, the value of Accuracy and is lower than CRC on Housing and Ionosphere, but the value of Average Cost is inverse.

Experiment 2 Similarly, we compare the performance of these 5 methods (SRC, CRC, SVM, ProCRC, CSCRC) on Letter, and evaluate the performance via G-mean and Average Cost for the class-imbalance problem. In this experiment, we taken the imbalance ratio from [1, 2, …, 10], respectively. The size of minority class is 30 and the majority class is 30 multiply the imbalance ratios in train set, accordingly. We select 61 positive samples and 61 negative samples as test set. The cost is set as mentioned above.

Note that there are also situations in which CSCRC is preferred. From the results on Figs. 1 and 2 we can see that CSCRC has higher G-mean than the other four methods except the imbalance ratio is 1. Meanwhile, CSCRC achieves the lowest Average Cost respect to the other methods. This suggests that Cascade can focus on more useful data. With the increasing of imbalance ratio, we have more training samples, and the proposed method can classify the samples correctly when the imbalance ratio is up to 4. Generally speaking, class-imbalance does affect the proposed method CSCRC. Concretely, CSCRC is not influenced by the distribution of samples, we can also get a better classify result when the imbalance ratio is high.

Fig. 1
figure 1

The result of G-mean on letter

Fig. 2
figure 2

The result of average cost on letter

4 Conclusions

The class imbalanced datasets occurs in many real-world applications where the class distributions of data are highly imbalanced. This paper, we propose a novel method to handle misclassification cost and class imbalance problem called Cost-sensitive Collaborative Representation Classification based Probability Estimation. The proposed approach adopted probabilistic model and sparse representation coefficient matrix to estimate the prior probability and then obtained the label of a testing sample by minimizing the misclassification losses. The experimental results show that the proposed CSCRC has a comparable or even lower average cost with higher accuracy compare to the other four classification algorithm.

In order to simplify the cost matrix, we restrict our discussion to two-class problems. So, extending our current work to multi-class scenario is a main research direction for our future work.