Keywords

1 Introduction

Support Vector Machine (SVM) constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by the hyperplane that has the largest distance to the nearest training-data point of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier [1]. Conventional SVM can be used in many tasks including text and hypertext categorization, classification of images, classifying proteins in medical science, and recognizing hand-written characters. Although SVM has been validated effect on these applications, it still has two disadvantages. One is that SVM cannot process matrix patterns and another is that it cannot process imbalanced problem.

\(\bullet \) Matrix patterns which dimensions are \(m\times n\) (where m and n are both larger than 1) are the basic of matrix learning. For example, video and images are both matrix patterns. In order to process matrix patterns, matrix-pattern-oriented learning machine (MatC), i.e., matrix learning machine, has been developed. Classical learning machines include matrix-pattern-oriented Ho-Kashyap learning machine with regularization learning (MatMHKS) [2], new least squares support vector classification based on matrix patterns (MatLSSVC) [3], and one-class support vector machines based on matrix patterns (OCSVM) [4]. Besides those matrix learning machines, Xie et al. have proposed a Support Matrix Machine (SMM) [5] so as to replace SVM. SMM can leverage the structure of the data matrices and has the grouping effect property. But all of these matrix learning machines cannot process imbalanced problem.

\(\bullet \) As is known to all, in many real-world classification problems, such as e-mail foldering [6], fault diagnosis [7], detection of oil spills [8], and medical diagnosis [9], we can always divide a data set into two classes, one is positive class and the other is negative class. When the size of positive class is much smaller than that of negative class, imbalanced problem occurs. Since most standard classification learning machines including Support Vector Machine (SVM) and Neural Network (NN) are proposed with the assumption on the balanced class distributions or equal misclassification costs [10], so they fail to properly represent the distributive characteristics of patterns and result in the unfavorable performance when they are adopted to process imbalanced problem. In order to overcome such a disadvantage, Fuzzy SVM (FSVM) [11] and Bilateral-weighted FSVM (B-FSVM) [12] are proposed. FSVM applies a fuzzy membership to each input pattern and reformulates SVM such that different input patterns can make different contributions to the learning of decision surface. B-FSVM treats every pattern as both positive and negative classes, but with different memberships due to we can not say one pattern belongs to one class absolutely. But for both of them, how to determine the fuzzy membership function is the key point. Furthermore, both of them cannot process matrix patterns.

In this paper, we try to propose a learning machine which can process matrix patterns and imbalanced problem. First, in order to process matrix patterns, we adopt SMM as a basic. Then in order to process imbalanced problem, we adopt the notion of FSVM, namely, applies a fuzzy membership to each input pattern. Furthermore, for the fuzzy membership, we propose a new fuzzy membership evaluation approach which assigns the fuzzy membership of each pattern based on its class certainty. In this paper, class certainty demonstrates the certainty of pattern labeled to a certain class. Due to the entropy is an effective measure of certainty, we adopt the entropy to evaluate the class certainty of each pattern. In doing so, the entropy-based fuzzy membership evaluation approach is proposed. This approach determines the fuzzy membership of training patterns based on their corresponding entropies. By adopting the entropy-based fuzzy membership evaluation and SMM, the Entropy-based Support Matrix Machine (ESMM) is proposed to process the imbalanced data sets. In practice, as the importance of positive class is higher than that of negative class in imbalanced data sets, i.e., the learning machine should pay more attention to positive patterns than negative ones. Thus, positive patterns are assigned to relatively large fuzzy memberships to guarantee the importance of positive class here. While, the fuzzy memberships of negative patterns are determined by the entropy-based fuzzy membership evaluation approach, i.e., patterns with lower class certainty are assigned to small fuzzy memberships based on the criterion that patterns with lower class certainty are more insensitive to noise, and easily mislead the decision surface, thus their importance should be weakened on imbalanced data sets. After evaluating the fuzzy membership of all training patterns, ESMM is adopted to classify imbalanced data sets.

The contributions of this paper can be highlighted as follows:

  1. (1)

    A new entropy-based fuzzy membership evaluation approach is proposed. This approach adopts entropy to evaluate class certainty of a pattern and determines the corresponding fuzzy membership based on class certainty. In doing so, the learning machine can pay more attention to the patterns with higher class certainty to result in more robust decision surface.

  2. (2)

    To guarantee the importance of positive class, the positive patterns are assigned to the relatively large fuzzy memberships, which results in the decision surface paying more attention to the positive class so as to increase generalization of learning machine.

The rest of this paper is given below. Section 2 introduces the proposed entropy-based fuzzy membership evaluation approach, then give the details of ESMM. In Sect. 2.1, several experiments on real-world imbalanced data sets and matrix patterns including images are conducted to validate the effectiveness of ESMM. Following that, conclusions are given in Sect. 4.

2 Entropy-Based Matrix Learning Machine (ESMM)

2.1 Entropy-Based Fuzzy Membership

When we process imbalanced data sets, positive class is always more important than negative class. Thus, in the proposed entropy-based fuzzy membership evaluation approach, we assign positive patterns to a relatively high fuzzy membership, e.g., 1.0, to guarantee the importance of positive class. As to the negative class, we fuzzify negative patterns based on their class certainties.

In information theory, entropy is always used to characterize the certainty of source of information. If entropy is smaller, then information is more certain [13]. By employing this character of entropy, we can evaluate the class certainty of a pattern in the training class. After getting the class certainty of each pattern, we assign fuzzy membership of each training pattern based on class certainty. In practice, the patterns with higher class certainty, i.e., lower entropy, are assigned to higher fuzzy memberships to enhance their contributions to the decision surface, and vice versa.

Suppose that there are N training patterns, \(\{x_i,y_i\}\) where \(i=1,2,\ldots ,N\) and \(y_i\in \{+1,-1\}\) is the class label. When \(y_i=1\), pattern \(x_i\) belongs to the positive class, otherwise, it belongs to the negative class. The probabilities of \(x_i\) belonging to positive and negative class are \(p_{+i}\) and \(p_{-i}\) respectively. The entropy of \(x_i\) is \(H_i=-p_{+i}ln(p_{+i})-p_{-i}ln(p_{-i})\) where ln represents the natural logarithm operator. Due to neighbors of a pattern can determine local information of it, thus the probability evaluation is based on its k nearest neighbors. For a pattern \(x_i\), we select its k nearest neighbors \(\{x_{i1},x_{i2},\ldots ,x_{ik}\}\) at first. Then we count the number of both positive and negative class in these k selected patterns and denote the numbers of patterns belonging to positive and negative class are \(num_{+i}\) and \(num_{-i}\) respectively. Finally, the probabilities of \(x_i\) belonging to positive and negative class are calculated with \(p_{+i}=\frac{num_{+i}}{k}\) and \(p_{-i}=\frac{num_{-i}}{k}\). After evaluating the class probabilities of \(x_i\), we can calculate its entropy.

By adopting the above entropy evaluation approach, the entropy of the negative patterns are \(H=\{H_{-1},H_{-2},\ldots ,H_{-n_{-}}\}\), where \(n_{-}\) is the number of the negative patterns. \(H_{min}\) and \(H_{max}\) are the minimum and maximum entropy of H. The entropy-based fuzzy memberships for negative patterns are evaluated as follows.

Firstly, separate the negative patterns into m subsets based on their entropy as described in Table 1, i.e., \(\{Sub_{k}\}\) where \(k=1,2,\ldots ,m\).

Table 1. Algorithm of negative class separation.

Then, fuzzy memberships of patterns in each subset are set as:

$$\begin{aligned} FM_{k}=1.0-\alpha \times (k-1),\ \ k=1,2,3,\ldots ,m \end{aligned}$$
(1)

where \(FM_{k}\) is the fuzzy membership for patterns distributed in subset \(Sub_k\), the fuzzy membership parameter \(\alpha \in (0,\frac{1}{m-1})\) since \(FM_{k}\) is positive and not larger than 1.0. It should be declared that patterns in the same subset are set to same fuzzy membership so that these patterns selected in the same subsets have same importance to the decision surface. Finally, the fuzzy membership \(s_i\) for a training pattern \(x_i\) is assigned as: if \(y_i=+1\), then \(s_i=1.0\), else if \(y_i=-1\) and \(x_i\in Sub_k\), then \(s_i=FM_{k}\). So far, the entropy-based fuzzy membership for the training patterns are evaluated.

2.2 Entropy-Based Support Matrix Machine

By adopting the evaluated entropy-based fuzzy membership which is given before, we propose the entropy-based support matrix machine (ESMM). The detailed description on ESMM is given below.

Suppose that there is a binary-class classification problem with N matrix patterns \((A_i,y_i,s_i)\), \(i=1,2,\ldots ,N\). Here \(A_i\in R^{m\times n}\) is the matrix representation of \(x_i\) and its class label is \(y_i \in {\{+1,-1\}}\). If \(y_i=+1\), \(x_i\) or \(A_i\) belongs to class \(+1\) or positive class, and then if \(y_i=-1\), the pattern belongs to class \(-1\) or negative class. \(s_i\) is the entropy-based fuzzy membership.

The corresponding criterion function of ESMM is defined below.

$$\begin{aligned} min L(W,b)=\frac{1}{2}tr(W^TW)+\theta \left| \left| W\right| \right| _{\star }+\\\nonumber C\sum _{i=1}^{N}{(s_i(1-y_i[tr(W^TA_i)+b_i]))} \end{aligned}$$
(2)

where \(W\in R^{m\times n}\) is the matrix of regression coefficients and its nuclear norm is \(\left| \left| W\right| \right| _{\star }\). \(\theta \) is the coefficient. tr is the trace of matrix. \(b_i\) is a loose variable for pattern \(A_i\). C (\(C\in {R},\) \(C\ge 0\)) is the regularization parameter that adjusts the tradeoff between model complexity and training error. Here, \(b=[b_1,\ldots ,b_i,\ldots ,b_N]^T\) and \(b_{i}\) is started as \(b_{i}\ge 0 \). The iteration for b is given in Eq. (3).

$$\begin{aligned} b_{i}(k+1)=b_{i}(k)+\rho (e_{i}(k)+|e_{i}(k)|) \end{aligned}$$
(3)

where the error vector e at k-th iteration of \(A_i\), i.e., \(e_{i}(k)\) should be \(W(k)^TA_i-1-b_i(k)\) and W(k) and \(b_i(k)\) are k-th component of W and \(b_i\). Then we adopt the similar method of SMM to get the optimal W and b here. After that, we can get the discriminant function of ESMM as below.

$$\begin{aligned} g(A_i)=W^TA_i \end{aligned}$$
(4)

If \(g(A_i)>0\), then we label \(A_i\) as a positive pattern, then if \(g(A_i)<0\), we label \(A_i\) as a negative pattern.

3 Experiments

In this section, we adopt 25 real-world imbalanced data sets and 5 image data sets for examples and the compared learning machines are SVM, FSVM, MatMHKS, B-FSVM, SMM. The used 25 real-world imbalanced data sets are selected from the KEEL imbalanced benchmark ones [14, 15]. Information of these data sets are given in Tables 2 and 3.

Table 2. Information of real-world imbalanced data sets (IR represents the imbalanced ratio of the corresponding data set).
Table 3. Information of image data sets. Imbalanced ratio of each image data set is 9.0.
Table 4. AUC values (\(\%\)) of the compared learning machines on real-world imbalanced data sets and image data sets. The best k for each data sets of ESMM is presented. (Note that the average AUC values and the average ranks of the compared learning machines are listed in the last two rows. Best result for each data set is in bold.)

3.1 Experimental Settings

In terms of the compared learning machines and ESMM, the experimental settings are given here. For the SVM-based learning machines, the used kernel is Radial Basis Function (RBF) kernel \(ker(x_i,x_j)=exp(-\frac{\left| \left| x_i-x_j\right| \right| _2^2}{\sigma ^2})\) where \(\sigma \) is selected from the set \(\{10^{-3}, 10^{-2},\ldots , 100, 1000\}\). For MatMHKS, the experimental setting can be found in [2]. For SMM, the experimental setting can be found in [5]. For ESMM, its setting is similar with SMM. Moreover, for ESMM, the number of the separated subsets \(m=10\) and the fuzzy membership parameter \(\alpha =0.05\) which results in the fuzzy membership \(0.5\le s_i\le 1.0\). The reason of restricting \(s_i\in [0.5,1.0]\) is that the class label of \(x_i\) should not be neglected when determining the fuzzy membership, i.e., \(x_i\) is more likely to be classified to the class which is indicated by its class label. As to negative class patterns, we set \(s_i>0.5\) to indicate that a negative pattern \(x_i\) is more likely to belong to the negative class. Moreover, the number of nearest neighbors k for calculating the class probability is selected from \(\{1,2,3,\ldots ,18,19,20\}\). In order to measure the performance of compared learning machines on imbalanced data sets, the values of Area Under the ROC Curve (AUC) [25] is given. Besides those, \(maxIter=500\) represents the maximal size of the iteration. One-against-one classification strategy is used for multi-class problems here [21,22,23,24]. The 10-fold cross validation approach [26] is adopted for the parameter selection. The computations are performed on Intel Core 2 processors with 2.66 GHz, 8G RAM, Microsoft Windows 7, and Matlab environment.

3.2 Experiments on Real-World Imbalanced Data Sets

For experiments, the AUC values are presented in Table 4. The average AUC on all used data sets are presented. Moreover, the average ranks of the learning machines on the used data sets are listed.

From the experimental results, it is found that: (1) ESMM results in the best classification performance on 12 of 30 imbalanced data sets which indicates that the our proposal outperforms the compared learning machines; (2) ESMM outperforms the traditional MatMHKS on 26 of 30 imbalanced data sets; (3) The average AUC of ESMM is respectively greater than B-FSVM, FSVM, and SVM on about 5\(\%\), 9\(\%\), and 10\(\%\), which demonstrates that ESMM is of significate advantage in processing imbalanced data sets compared to SVM, FSVM, and B-FSVM; (4) It is found that for some data sets, for example, COIL-20, the three matrix-pattern-oriented approaches perform worse than SVM. We think it is the coincidence due to the training part is gotten in random. But according to the average result, we still find that our proposed ESMM outperforms others in average; (5) For the used 25 vector data sets, ESMM performs best on the 9 data sets of them. For others, ESMM performs better in average and it does not perform worst on any vector data set. This phenomenon can explain the superiority of the ESMM on the vector data sets; (6) As we said before, ESMM and FSVM both adopt fuzzy membership to each input pattern. Now from the experiments, it is found that compared with FSVM, the better performance of ESMM validates that the proposed entropy-based fuzzy membership evaluation approach boosts the performance of a learning machine.

3.3 Influence of Parameter k on the Performance of ESMM

In ESMM, the entropy-based fuzzy membership is evaluated based on the class probability of each training pattern. Thus, the number of nearest neighbors k might have some influence on the class probability. To further investigate the effectiveness of ESMM, here, we study the influence of k on the classification performance. All data sets given in Table 2 are used and related experimental results are given in Fig. 1. The figure shows AUC on the testing sets of the adopted real-world imbalanced data sets and image data sets with respect to k. It is found that: (1) the number of the selected nearest neighbors for calculating the class probability, i.e., k, has some influence on the classification performance since AUC curves fluctuate with respect to k on most data sets; (2) on some data sets, the classification performances are sensitive to the variation of k, since AUC curves on these data sets fluctuate greatly while on some data sets, are not; (3) in generally, with k from 8 to 12, ESMM always gets best performance. Such a result can give us a guidance that how to determine an appropriate k in practical use.

Fig. 1.
figure 1

Variation of AUC values of ESMM with respect to k on the adopted imbalanced data sets and image data sets. Here, in the legend, each number denotes one data set which is given in Tables 2 or 3.

3.4 Comparison Between ESMM and Entropy-Based MatMHKS (EMatMHKS)

Here, we give the comparison between ESMM and our previous proposed learning machine, entropy-based MatMHKS (EMatMHKS) [27] which is also an entropy-based matrix learning machine. Table 5 shows the comparison between ESMM and EMatMHKS. \(\uparrow \) (\(\downarrow \)) represents that ESMM performance better (worse) than EMatMHKS and \((\star )\) represents that the AUC value of ESMM is \(\star \) larger or smaller than the one of EMatMHKS. From this table, we find that ESMM performs better than EMatMHKS in average.

Table 5. Comparison between ESMM and EMatMHKS on AUC values (\(\%\)).

4 Conclusion

There are two hot spots of present research, one is imbalanced problem and the other is matrix learning. Imbalanced problem occurs when the size of positive class is more smaller than that of negative class. However, most standard classification learning machines result in unfavorable performance on imbalanced data sets since they are originally designed for processing balanced problems. Although SVM can process imbalanced data sets in some extent, it assigns the same importance to each training pattern. This results in the decision surfaces biasing toward the negative class. In order to overcome the disadvantage of SVM, some researchers propose FSVM and B-FSVM by applying fuzzy memberships to the training patterns to reflect different importance of them. Since the key point in FSVM and B-FSVM is how to determine the fuzzy membership, so this paper presents an entropy-based fuzzy membership evaluation approach for imbalanced data sets. Moreover, matrix patterns cannot be solved by those traditional learning machines including SVM well, so some scholars have developed MatMHKS and SMM. This paper adopts the entropy-based fuzzy membership and SMM, and then proposes ESMM. ESMM can not only guarantee importance of the positive class, but also pay more attentions to the patterns with higher class certainties. Thus, ESMM can results in more flexible decision surfaces than both conventional SVM, FSVM, B-FSVM, and MatMHKS on the imbalanced data sets. To validate the effectiveness of ESMM, we adopt 25 real-world imbalanced data sets and 5 image data sets for experiments. Experimental results demonstrates that ESMM outperforms the compared learning machines on real-world imbalanced data sets and the images. Moreover, in the process of ESMM, k has some influence on the classification performance.