Keywords

1 Introduction

Feature selection is used in many application areas relevant to expert and intelligent system such machine learning, bioinformatics and image processing. In computer vision, machine learning, and data mining, data are always represented by high-dimensional feature vectors. But high-dimensional data can increase the consumption of processing time and storage space during processing. Moreover, most of the existing machine learning methods, such as classification, regression, or other tasks, are mainly designed more adaptive to low-dimensional data. The computation of high-dimensional data can usually become much more complex and difficult. As a solution to this issue, feature selection (also known as variable selection) [2, 13, 20, 21] is performed to choose a representative subset from the high dimensional features. The subset is expected to bear sufficient information about the original high-dimensional feature set for specific learning tasks. The purpose of feature selection is to find the relevant feature subset that best reflects the statistical properties of the pattern category to represent the original feature.

According to different evaluation metrics, feature selection algorithms can roughly classified into three categories, i.e., filter, wrapper and embedded methods [13]. The filter methods rely on general characteristics of the data to evaluate and select feature subsets without considering the learning algorithm. Such as variance and Fisher score [11, 14]. The wrapper model takes the performance of the learner to be used as the evaluation criterion. For embedded methods, the process of feature selection and learner training is integrated, both are completed in the same process, and in the process of training the learning algorithms automatic feature selection.

Recently, some new embedded feature selection methods that integrate the theory of sparse representation [3, 8, 24], compressed sensing, and feature selection [4, 18, 26] have been proposed. Sparsity-inducing feature selection methods have been widely used in face authentication [5, 23], face detection, face attributes classification [9], and gene expression [25]. However, \({L}_{1}\) regulraization computation typically requires solving either NP-problem or an alternative problem that sill involves a costly iterative optimization [7].

In practical applications, the features have some essential structures. Integrating knowledge about the feature structures may help identify the important features. Ye and Liu [27] and Zhang et al. [28] proposed to use the group structure information of the data to carry on the feature selection. But the existence of these methods have a defect that the number of groups in the group structure is man-made, and the automatic grouping is not realized. On the one hand, the feature set F is artificially divided into k groups, and this grouping is usually carried out by experience, still do not achieve automatic grouping, on the other hand, the so-called grouping is to divide the adjacent m features into a group, the features that are adjacent to each other do not necessarily belong to the same group.

Taking into account these factors, in this paper, we propose a novel automatic grouping feature selection method, which use \({l}_{2}\) norm to ensure group effect [31] and use the mutual information measure to ensure the low redundancy within the group. The Laplacian regularization based on mutual information is used in the objective function of the method, so we call it MIL. Our experiments demonstrate the efficiency of the proposed method. The difference between our approach and the traditional group lasso method is shown in Fig. 1. As shown in Fig. 1, the traditional group lasso method is continuous and the number of groups is decided by people. In addition, the group is fixed once it is determined. The MIL method combines autonomously by judging the correlation between features, where each group of features can be discontinuous and the number of features of each group can be unequal.

Fig. 1.
figure 1

Comparison between traditional group lasso and MIL methods. Suppose there is a data set \(A \in {R}^{7{\times }15}\), we use the 7\({\ \times \ }\)15 table to represent the A dataset, each row represents a sample, each column represents a feature, the same color indicates the same group, (a) traditional group lasso grouping result, (b) MIL grouping result. (Color figure online)

The remainder of the paper is organized as follows. Sect. 2 presents different related methods and information theory used in our paper. Section 3 introduces our proposed method. Experimental evaluation is depicted in Sect. 4. At last, we conclude the paper in Sect. 5.

2 Related Work

In this section, we will introduce some of the existing feature selection algorithms, and then introduce the basics of the information theory, as we will use it in Sect. 3. Before the details are presented, we need to summarize the notations and the definitions used in this paper. Let \(A=\left[ { a }_{ 1 },{ a }_{ 2 },\ldots ,{ a }_{ n } \right] \in { R }^{ n{\times }d }\) be n samples data in the d-dimensional space, where n is the number of sample and d is the number of features. Accordingly, denote label vector \(y={ \left[ { y }_{ 1 },{ y }_{ 2 },\ldots ,{ y }_{ n } \right] }^{ T }\in { R }^{ n{\times }1 }\), which \({ y }_{ i }\in \left\{ +1,-1 \right\} ,i \in \left\{ 1,2,\ldots ,n \right\} \) if task is binary classification. Denote \(\lambda \) as the hyper parameter to balance the data misfit and the penalty. Denote \(W={ \left[ w_{ 1 },{ w }_{ 2 },\ldots ,{ w }_{ d } \right] }^{ T }\in { R }^{d{\times }1 }\) as the unknown weight coefficient vector, which need we to be estimated.

2.1 Related Feature Selection Methods

As a kind of embedded method, regularization techniques based on \({L}_{1}\) norm has been widely used to cope with feature selection in machine learning tasks. According to compressive sensing theory, the minimum \({L}_{1}\) norm solution to an under determined system of linear equation is equivalent to the sparsest possible solution under general conditions. Destrero et al. [5, 6] used lasso for feature selection in face detection and face authentication. Its objective function is:

$$\begin{aligned} \min _{ W }{ { \left\| y-WA \right\| }_{ 2 }^{ 2 } } +\lambda { \left\| W \right\| }_{ 1 } \end{aligned}$$
(1)

Lasso is suboptimal since it produces biased estimates for the large coefficients. Zou [30] found that Lasso uses the same degree of compression for all coefficients. And Lasso does not have the oracle properties. In order to improve the performance of Lasso, the adaptive Lasso [30] is proposed

$$\begin{aligned} \min _{ W }{ { \left\| y-WA \right\| }_{ 2 }^{ 2 } } +\lambda \sum _{ i=1 }^{ d }{ { a }_{ i } } { \left\| { w }_{ i } \right\| }_{ 1 } \end{aligned}$$
(2)

From Eq. 2, we know the only difference between Lasso and adaptive Lasso is the latter gives a weight coefficient for each. Different compression coefficients are used for different weights, let Lasso has oracle properties. It not only has good usability in prctice but also has an excellent character in theory. Obviously, when all coefficient is equal, the adaptive Lasso is equivalent to Lasso.

The fused Lasso introduced in [22] get a solution that has sparse both the coefficient and their successive differences. The objective function of fused Lasso can be represented as follows:

$$\begin{aligned} \min _{ W }{ { \left\| y-WA \right\| }_{ 2 }^{ 2 } } +{ \left\| W \right\| }_{ 1 }+\alpha \sum _{ i=2 }^{ d }{ \left| { w }_{ i }-{ w }_{ i+1 } \right| } \end{aligned}$$
(3)

The bridge estimator [17] is defined as follows:

$$\begin{aligned} \min _{ W }{ { \left\| y-WA \right\| }_{ 2 }^{ 2 } } +\lambda \sum _{i=1}^{d}\left| w_{i} \right| ^{\gamma } \end{aligned}$$
(4)

The bridge estimator has two important special cases. When \(\gamma \) = 2, it is popular ridge estimator. When \(\gamma \) = 1, it is the Lasso.

In many practical applications, some features often have a strong correlation. In this case, the lasso tends to select only one of the correlated features. To deal with feature with strong correlation, Zou and Hastie [31] proposed elastic net regularization as

$$\begin{aligned} \min _{ W }{ { \left\| y-WA \right\| }_{ 2 }^{ 2 } } +\alpha \sum _{ i=1 }^{ d }{ \left| { w }_{ i } \right| } +\left( 1-\alpha \right) \sum _{ i=1 }^{ d }{ { \left| { w }_{ i } \right| }_{ 2 } } \end{aligned}$$
(5)

where \({ \left| \cdot \right| }_{ 2 }\) is the \({L}_{2}\)-norm. [31] show that \({L}_{2}\) regularization has the group effect and \({L}_{1}\) regularization does not have group effect Zou et al. [31] add a group effect to lasso by using \({L}_{2}\) regularization to handle feature with strong correlations. Furthermore, when \( \alpha \) is equal to 1, the elastic net is lasso, and when \(\alpha \) is equal to 0, the elastic net is ridge estimator [17].

The penalties introduced in Eqs. (1)–(5) are assume that features are independent and ignored the structures of features completely [27]. However, in practical application, the features have some essential structures, such as groups [27, 28]. Suppose that features are divided into k groups. With the group structure, the W is rewritten as k groups \(W=\left\{ { w }_{ G1 },{ w }_{ G2 },\ldots , { w }_{ Gk } \right\} \), and the objective function of group lasso as follows:

$$\begin{aligned} \min _{ W }{ { \left\| y-WA \right\| }_{ 2 }^{ 2 } } +\sum _{ i=1 }^{ k }{ { \beta }_{ i }{ \left\| { w }_{ Gi } \right\| }_{ q } } \end{aligned}$$
(6)

where the \({ \left\| \cdot \right\| }_{ q }\) is indicate q-norm, and \({ \beta }_{ i }\) is the weight coefficient of i-th group. There are different structured feature selection methods according to different q values or different constraints, such sparse group lasso [29]. The group structure provides good access to the structural property of the data. However, the number of groups in the group structure is man-made, and the automatic grouping is not realized. Moreover, the features that are adjacent to each other do not necessarily belong to the same group.

2.2 Information Theory

Given the two variables U, V, if their respective marginal probability distribution and joint probability distribution are respectively \(p\left( u \right) \), \(p\left( v \right) \) and \(p\left( u,v \right) \), and then their mutual information \(I\left( u,v \right) \) is defined as:

$$\begin{aligned} I\left( u,v \right) =\sum _{ u,v }^{ }{ p\left( u,v \right) } log\frac{ p\left( u,v \right) }{ p\left( u \right) p\left( v \right) } \end{aligned}$$
(7)

When the variables U and V completely unrelated or independent of each other, the minimum mutual information, the result is zero, which means that there is no overlapping information between the two variables; on the other hand, the greater the interdependence, mutual information value will be greater.

3 MIL

In Sect. 2, we analyze some feature selection algorithms. On the one hand, \({L}_{1}\) regularization computation typically requires solving either NP-problem or an alternative problem that still involves a costly iterative optimization [7], on the other hand, Traditional methods based on structural features are not automatically group. Therefore, we proposed a novel feature selection model, MIL.

Assume given a set of training samples \(A\in { R }^{ n{\times }d }\) and the target labels \(y\in { R }^{ n{\times }1}\) of the corresponding samples, the MIL uses the following criterion:

$$\begin{aligned} \min _{ W }{ { \left\| y-WA \right\| }_{ 2 }^{ 2 } } +\alpha { \left\| W \right\| }_{ 2 }+\beta \sum _{ i }^{ }{ \sum _{ j }^{ }{ { MI }_{ ij } } } { \left( { w }_{ i }-{ w }_{ j } \right) }^{ 2 }. \end{aligned}$$
(8)

to find W, where \(\alpha \) and \(\beta \) is the hyper parameter of the objective function. The first term of function (8) is the least-squares function. According to the least-squares method to seek the relationship of features and target labels. The second term of function (8) is \({ l }_{ 2 }\) norm. In this paper, the group effect of MIL is obtained via the \({L}_{2}\) norm, where the group is determined by the correlation between the features, and if the correlation between the two features is large, then we classify it as a group, and if the correlation is small, it is considered not a group. The third term of (8) is a manifold regularization where \({ MI }_{ ij }\) is the correlation between the i-th and j-th feature. In this paper, the \({ MI }_{ ij }\) is defined as the mutual information of the i-th and j-th feature (Mutual information can be used to measure the degree of interdependence between the two variables, and not limited to linear correlation, it also can be applied to nonlinear correlation). It is reasonable to require \({ w }_{ i }\) and \({ w }_{ j }\) close to each other if the i-th and j-th feature Similarity (redundancy) is low, which is the objective of the term of the third term of (8). In MIL, we use the third term of (8) to ensure that the redundancy in the group is minimal. In fact, the third term of (8) have the ability of \({ l }_{ 1 }\) penalty in feature selection, it can ensure the minimal redundancy of the feature subset. Denote MI as the similarity matrix constructed by all \({ MI }_{ ij }\) and the diagonal matrix D where the element i-th of is the sum of the i-th row of the MI. Therefore, we can get the Laplacian matrix \(L=D-MI\) [15], and the third term of (8) can be represented as \(\beta { W }^{ T }LW\) by simple algebra. The objective function can be rewritten as follows:

$$\begin{aligned} \min _{ W }{ { \left\| y-WA \right\| }_{ 2 }^{ 2 } } +\alpha { \left\| W \right\| }_{ 2 }+\beta { W }^{ T }LW. \end{aligned}$$
(9)

Fortunately, (9) has an analytical solution as follows:

$$\begin{aligned} W={ \left( { A }^{ T }A+\alpha I+\beta L \right) }^{ -1 }{ A }^{ T }y. \end{aligned}$$
(10)

After obtain W, we can rank features according to \(|{ w }_{ i }|\). The larger \(|{ w }_{ i }|\) is, the more important this feature is [12]. We can either select a fixed number of the most important features or set a threshold and select the feature whose \(|{ w }_{ i }|\) is larger than the value [16].

4 Experimental Evaluation

In the section, we present experimental results of our method. The performance of the newly proposed method in this paper, MIL, is mainly compared with five other methods: Lasso (\({L}_{1}\) regularization) [5], Ridge (\({L}_{2}\) regularization) [17], elastic-net [31], and group lasso [28]. These methods are chosen for the following reasons:(a) Like MIL approach, the choice of these methods based on regularization; (b) These methods are reported in the paper to provide good performance; (c) These methods contain structure and non-structure method, group effect and no group effect method, it helps to compare the performance of the algorithms. In the experiment, we use non-image and image data set.

The non-image data sets are Colon [1] and Leukemia [10], where Colon contains 2000 dimensions raw features and Leukemia contains 7029 dimensions raw features. For image data set, we select the CFW [19] face data set, which is a large collection of celebrity face images from the Internet. The data set contains 200,000 face images for 1,500 celebrities. We selected 8,000 face images (20 images \(\ \times \ \) 400 people) from the CFW 60K to carry out attributes classification experiments, the 14 kinds of face attributes included in the selected pictures were gender, race, age and so on (more detail see Table 1). Then we use the ULBP to extracting low-level features. To be more specific, we scaling the face images to 140\(\ \times \ \)160 pixels, and divided image into 7\(\ \times \ \)8 cells, each of cell is 20\(\ \times \ \)20 pixels. Then use the ULBP descriptor to extract 3,304-dimensional features as the raw features.

Fig. 2.
figure 2

Classification accuracy rate achieved with the Colon data set.

Fig. 3.
figure 3

Classification accuracy rate achieved with the Leukemia data set.

4.1 Performance Analysis with Colon and Leukemia Dataset

In the first experiment, we use the non-image data sets to evaluate our approach. We only selected the 50 features to observe the differences between the different methods. Figures 2 and 3 show the classification accuracy of the two datasets (Colon, Leukemia). As shown in Fig. 2, which illustrates the experiment with Colon, MIL reached stability with just 12 features and from the 4-th feature, MIL has always been far ahead of other methods. Compared to MIL, several other methods do not achieve better classification results. Without losing generality, the Fig. 3 shows similar results in Leukemia data set.

4.2 Performance Analysis on Face Dataset

In this experiment, we test MIL on image data set. In CFW dataset, we chose 14 face attributes (more detail see Table 1), and we used the MIL method to classify the 14 attributes. To observe the variation of accuracy with respect to a number of dimensions, the recognition rate is calculated from 10, 20,\(\ \ldots \), to 100, and from 100, 200,\(\ \ldots \), to 700. The results are shown in Table 1.

From the Table 1, we can find that our method has good performance. By calculating the average recognition rate of all dimensions, it can reflect the relationship between the recognition rate and the dimension. In general, if the average recognition rate is higher, then the number of dimensions required to achieve the same recognition rate will be less. However, the classification results obtained by the group sparse are not bad. We all know that most of the image feature description is based on statistical method, ULBP is the case, which itself counts as an area of information. In this article, the statistical size of the region is 20\(\ \times \ \)20 pixels. Therefore, selecting continuous features as a set does not have much impact on the outcome of the experiment. The flaw in algorithms that cannot be grouped automatically is not so obvious.

Table 1. The average recognition accuracy rate \((\%)\) of five different methods, highest values are in blod, the average is calculated by the mean accuracy rate across a range of feature set size (from 10 to the maximum number (700) of selected features).

4.3 Efficiency

Through the above analysis, we can say that our method is effective in feature selection. In this experiment, we test the efficiency of MIL compared to other approaches. However, we know the time complexity of MIL is O(kNN), where N is defined as the feature numbers. MIL is thus generally more computationally demanding. In this paper, the raw features of face data set are 3304 dimensions. We use the time we calculate the weight coefficient W as a measure. In the experiment, which involves calculating \({L}_{1}\) regularization, we set the error tolerance to \({ 10 }^{ -6 }\). Taking face gender attribute as an example, the result is shown in Fig. 4.

Fig. 4.
figure 4

Run time (s) comparison between MIL and other feature selection algorithms, measured by the time for computing W weight coefficient vector from gender attribute of the CFW data set.

As shown in Fig. 4, the proposed method takes 83.06 s to calculate CFW dataset with 3304 dimensions features. MIL is considerably expensive. Nevertheless, we believe that the time complexity should not be a major deterrent to the practicality of MIL. There are many applications where the data collecting time is far more than the time required for data mining tasks such as feature selection (e.g., days to months for data collection vs. hours for data mining). Commodity multi-core systems are common nowadays, and it is straightforward to parallelize MIL to harness this parallel processing power. In these cases, it is justifiable to spend significant amounts of time for data processing and the improved performance brought about by MIL will be worth the effort. Towards this end, we tested a parallel version of MIL. The effectiveness of parallelization can be clearly observed in Fig. 5.

Fig. 5.
figure 5

Run time (s) comparison between serial and parallel MIL, measured by the time for computing W weight coefficient vector from gender attribute of the CFW data set.

As seen in Fig. 5, the parallel version of MIL can effectively reduce the computation time of W weights. When using a 8 core processor, the computational efficiency of the MIL is greatly improved, with a computation time of 27.02 s. However, as the number of cores increases, the latter’s promotion is not very obvious. After using parallel computing, the amount of time that MIL spends on computing the same dimensions training sample will be shortened to half of elastic-net and group-lasso.

5 Conclusions

In this paper, we proposed a novel feature selection method, which can realize automatic grouping. The proposed method is to use mutual information to achieve the minimum redundancy of each group. Because the amount of calculation of mutual information is increased by the increase in the number of features, we use parallel computing to reduce the cost of computing. It is worth mentioning that the calculation of \({L}_{1}\) regularization can not use parallel computing to speed up. We compare with other feature selection methods to evaluate the effectiveness of the proposed MIL. Experimental results show that the MIL can obtain high recognition rate with fewer feature dimensions, and outperforms the other methods on several public data sets.