Keywords

1 Introduction

Nowadays machine learning techniques are being used in ever more fields, such as broadly understood medicine, neuroimaging, image classification and detection of network attacks. They produce huge amounts of data with many attributes. Such a large dose of information, paradoxically, does not improve the quality of algorithms, and the data itself is expensive to acquire and store. This resulted in the need for methods to reduce the size of the data, without degrading (or even improving) the quality of classifiers. The reason why more information does not mean better classification is the so-called curse of dimensionality, described for the first time by Richard Bellman [1]. When adding dimensions to collections, the distances between specific points are constantly increasing. The number of objects needed for proper generalization is also increasing. It is estimated that in the case of linear classifiers this number increases linearly with dimensionality, and squarely in the case of quadratic algorithms. Even worse is the case of non-parametric classifiers, such as neural networks or those using radial base functions, where the number of objects needed for proper generalization increases exponentially [2]. Sometimes the problem of the curse of dimensionality is called small n large p” [4].

The curse of dimensionality results in the Hughes phenomenon [3]. For a fixed number of samples, recognition accuracy may first increase algorithms increase, but decreases when the number of attributes exceeds a certain optimal value. In addition to the distance between the samples, this is also caused by the noise in the data or insignificant features. Selection and extraction (reduction) features are used to reduce the dimensionality of the data. Feature selection is designed to select a subset of the features used for classification, while feature extraction is used to transform (e.g., linear) feature space.

2 Methods

Principal Component Analysis belongs to projection methods. The goal of projection methods is to find a mapping from original space with d dimensions for a new one (\(k <= d\)) space, to minimize information loss [5].

It is an unsupervised learning method, which means it doesn’t need class labels. In the case of pca, the new attributes are created in a way that maximises their variance. The algorithm aims to create new features (the so-called principal components) that will be uncorrelated (orthogonal) and ordered according to decreasing variance. In order for the algorithm to give correct results, the input data should be normalized first. The principal components are eigenvectors of the input attribute covariance matrix. Because the direction is important in them, these lengths are selected 1. Assuming that \(\lambda _{i}\) is the eigenvalue of the \(i^{th}\) eigenvector, after ordering the proportion of total variance is descending derived from the first k vectors can be calculated using the formula:

$$\begin{aligned} \frac{\lambda _{1} + \lambda _{2} + \ldots + \lambda _{k}}{\lambda _{1} + \lambda _{2} + \ldots + \lambda _{k} + \ldots + \lambda {n}} \end{aligned}$$
(1)

If the original dimensions of the input data are strongly correlated with each other, we get a small number of eigenvectors with large eigenvalues. A large reduction in dimensions is then possible. However, if the dimensions are not strongly correlated, k will be similar to n and it is not possible to reduce the dimensions without losing the initial part of the set variance [5]. If the number of attributes exceeds the number of objects, it is possible to reduce the dimensions to at most to the number of samples [6].

One of the disadvantages of pca is that it uses a linear transformation, which makes it unsuitable for more complex spaces. The solution to this problem may be to develop a basic algorithm with the so-called kernel trick, getting kpca (Kernel Principal Component Analysis).

In order to solve a non-linear problem, one would first have to transform the input space X as a certain highly-dimensional space F using the function \(\phi (x)\), and then e.g. calculate the scalar product \(\phi (x), \phi (x')\) . However, it would be computationally complicated. Therefore, choose the \(k(x, x') =\,<\) \(\phi (x), \phi (x')\) for some transformation \(\phi \) [7]. One of the models using this trick is e.g. svm classifier.

Another idea for developing pca is, for example, using class labels as in the development of Karhunen-Loève or carrying out selection of features in the space obtained by pca [8]. In addition to using the standard pca, new versions are often created to suit specific problems. One such variation of pca method is SuperPCA [12]. It is used in the classification problem related to hyperspectral imagining [17]. The method combines pca with a segmentation algorithm by means of super pixelization.

Another interesting development of pca is the d i pca (Dynamic Inner PCA), method, also used in process monitoring, but focusing on the aspect of data dynamics [13]. Its goal is to maximize covariance between components and their earlier values. It accomplishes this by extracting a model of dynamic hidden variables on which standard pca is then performed.

When it comes to supervised methods, lda is also still widely used. An example of the use of linear discriminant analysis is the already mentioned feature extraction for the task of cancer recognition based on microscopic tissue images [11]. A team from India used a different approach to diagnose lung cancer [14], that used computed tomography images as input. In the study, lda was used to reduce the size of the data (Optimal Deep Neural Network). The results showed an improvement in quality compared to previously used classifiers.

Another proposed method is factor-rotation-modified ccpca analysis. The authors [15] proposed factor rotation in terms of decision-making centroids. The method was used to assess the risk of lymphocytic leukaemia.

The article presents a new concept of gpca for building main components in the pca method. For this purpose, the stochastic-gradient-optimization method was used [16].

In the case of gpca properties and eigenvectors we are looking for a K matrix such that:

$$\begin{aligned} K_{i,j} = L(Z_{i}, Z_{j}), \end{aligned}$$
(2)

where L is a function of the goal, Z is a standardized variable, k is e.g. the kernel:

$$\begin{aligned} L(Z_i, Z_j) = \sum _{i=1}^{n}{\left( x_i-\omega ^{T}Z_j \right) ^{2}}, \end{aligned}$$
(3)

where: \(L(Z_i, Z_j)\) is a overall error on the training set, \(\omega ^{T}\) is a gradient.

By minimizing the function \(L(Z_i, Z_j)\) it starts with the selected start-up solution \(\omega _{0}=0\). Then the gradient is determined at the point \(\omega _{k-1}, \alpha _{k}\nabla _{L}\left( \omega _{k-1}\right) \). The step along the negative gradient is determined one by one:

$$\begin{aligned} \omega _{k}=\omega _{k-1}-\alpha _{k}\nabla _{L}\left( \omega _{k-1}\right) , \end{aligned}$$
(4)

where \(\alpha _{k}\) is the step length determined before the linear search. We calculate the gradient \(\nabla _{L}\) using the difference:

$$\begin{aligned} \frac{\partial \left( Z_i-\omega ^{T}Z_j \right) ^{2}}{\partial \omega _{j}}=-2\left( Z_i-\omega ^{T}Z_j \right) Z_ij \end{aligned}$$
(5)

Finally

$$\begin{aligned} \nabla _{L}\left( \omega \right) =-2\left( x_i-\omega ^{T}Z_j \right) Z_j. \end{aligned}$$
(6)

The number of principal components can now be represented as a linear combination of original variables Z

$$\begin{aligned} G_{k_{ij}}=\sum _{i=1}^{k}{\sum _{j=1}^{m}{a_{k_{ij},j}Z_j}}, \end{aligned}$$
(7)

where m is the number of primary variables in the training set, w is the number of main components, \(Z_j\) is the j-th standardized variable, \(G_{k_{ij}}\) is the i-th main component, \(a_{k_{ij},j}\) are factor loads.

The developed gpca method can be used in non-linear feature spaces. Other kernel functions may be proposed depending on the class the problem. In the article we consider a linear case.

3 Experimental Set-Up

The aim of the research is to build a feature extraction method that will allow more accurate classification of children with multiple sclerosis. The problem is important because the prognosis for the development of the disease is an extremely difficult process. Often, only appropriately selected variables allow for accurate classification of children to certain risk groups. The developed method gives a chance to build a tool that will support the physician in diagnostics and thus can contribute to the correct diagnosis and treatment of children. Because multiple sclerosis does not give initial clear-cut symptoms, well-chosen variables and risk groups can improve the quality of classification. This goal has become the most important reason for undertaking research on the construction of the extraction model, which will form the basis for classification using known algorithms. Similar studies have already been conducted and the developed ccpca method [15] has found real application in the classification people with lymphocytic leukaemia. Particular attention was paid to the newly developed gpca concept focusing on the optimization of factor rotation axes using the gradient method.

The real-world dataset was used in own research. Actual data relate to prognosis of multiple sclerosis in children. The data contained instances and features and two classes: – poor prognosis, – good prognosis. The number of respondents in the classes is , instances. So we have balanced data.

In the experiments, several methods of extracting features known from the literature have been compared. Including: pca (Principal Component Analysis) [5], kpca (Kernel Principal Component Analysis) [7], ccpca (Centroid Class Principal Component Analysis) [15], fa (Factor Analysis) [9], ica (Independent Component Analysis) [10], gpca (Gradient Component Analysis), which is the proposed proprietary method in this article.

Two experiments were performed in the tests, in which the accuracy score for three classifiers was verified in succession: svm (Support Vector Machine, rf (Random Forest) and k-nn (k-Nearest Neighbours).

The accuracy score metric was used to assess the quality of the classification. Wilcoxon signed rank test at statistical significance level \(\alpha =0.05\), was used to assess the differences between accuracy for different methods and algorithms. A five-stratified cross-validation was used in all experiments.

4 Experimental Evaluation

The conducted research was divided into two experiments. The results of the second experiment depend on the first experiment. In the first experiment, the number of principal components were determined experimentally for the pca, ccpca and gpca methods, which explain the set threshold of total variance. Thanks to this approach, we control the selection of main components, and thus the number of features that will form the basis of the classification. The thresholds for which the best algorithm classifications were obtained were included in the second experiment.

4.1 Experiment 1 - Determining the Quality of the Classification Depending on the Threshold of Total Explained Variance

Experiment 1 was carried out for three pca, ccpca and gpca methods. The thresholds of explained total variance were adopted by to . The study was conducted on three algorithms svm, rf and k-nn. The results are presented in the chart Figs. 1 and 2.

Fig. 1.
figure 1

The plot of the dependence of the classification accuracy on the applied thresholds of the total explained variance for the methods of extracting the PCA, CCPCA and GPCA features on 230 teaching standards.

Fig. 2.
figure 2

Plot of the relationship between the selection of object features for each of the three main components and the factor load values

The results of the tests in Experiment 1 show that for each pca, ccpca and gpca method there is a threshold of total variance at which the quality of all classifiers is the highest. As you can see, these thresholds are consistent and the best results of correct classifications with each pca method and classification algorithm are within 68–72%. It should be noted that for threshold 1 all features are taken for classification. In the case of 0.01, we have a situation where there is only one main component that combines are one to three attributes. For the 0.7 threshold, there are 3 main components. Also note that there is a slight data drift for different and near thresholds. However, as you can see, matching attributes to principal components is getting better. Therefore, there is a very interesting conclusion that as the total variance is threshold, the quality of matching attributes to these components increases. Figure 2 shows the results showing which features were assigned to a given principal component. The basis for classification of features into main components was the factor load value \(\lambda > 0.6\). The results indicate that we will get a better fit for decision class 2 of the problem for component 1, and class 2 will be better classified by the set of features in components 2 and 3. Based on the gpca method, the features Z7, Z8, Z10, Z12, Z14 and Z18 were rejected, which do not make a significant contribution to explaining object classes.

4.2 Experiment 2. Determining the Quality of Classification for Various Methods of Feature Extraction

The purpose of the experiment is to verify how proprietary ccpca and gpca algorithms perform in the task of extracting features against other methods, i.e. pca, kpca, fa and ica. The goal was achieved by checking the quality of real data classification using three algorithms: svm, rf and k-nn. Based on the results obtained in experiment 1, 70% of the total explained variance for the pca, ccpca and gpca methods was selected for the training data set. The Accuracy score obtained and Wilcoxon signed rank test is shown in Table 1. The first measuring points with the names of the algorithms relate to the case without using the feature extraction method. The next results, i.e. pca, ccpca, gpca, kpca, fa and ica relate to the classification for a given algorithm after the extraction of features by a given method.

Table 1. The results of the experiments for the binary case with application of accurace-score metrics. In the columns the algorithms are presented, where no means lack of extraction of an object’s features.

The first significant conclusion from the research is that after extraction with any of the methods, the quality of classification with each of the three algorithms increased statistically significantly (\(p<0.05\)). In the task of feature extraction, the best results are obtained by using the gpca and ccpca methods. Classification quality after application of gpca and ccpca were statistically comparable. Methods kpca and fa don’t differ significantly from each other. Method ica for algorithms rf and knn gave better results than in the case of extraction with the ica method ica.

5 Conclusions

The purpose of the work was to develop a feature extraction method based on updating the property matrix and eigenvector values. In this task, the stochastic gradients method was used, where the function of the goal was the regression function. The study was conducted on a balanced set describing prognosis of children with multiple sclerosis. In during the analysis, it was possible to create a model that gives promising results for such a task. Two experiments were carried out in the work. The first assumed estimation of the gpca model parameters, i.e. the threshold of the greedy explained variance giving the best quality of classification, estimation of the belonging of variables to the main components. In experiment 2, the quality of svm, RF and k-nn algorithm classification was tested for various methods of feature extraction. The obtained results showed that the best extraction method is gpca and ccpca. The method of stochastic gradients used in the task of minimizing the error in estimating the matrix of eigenvector values proved to be a good approach. The estimation of gpca components was also carried out for each decision class. In this way, although the same sets of characteristics for each class in each component were obtained, but different matching attributes of the teaching set, which in turn contributed to improving the quality of classification. The gpca algorithm proved comparable to ccpca method which was based on Varimax rotation normalized with respect to decision-making centroids. The elaborated method was, as already mentioned, tested on real data with ms disease in children. However, it can be used for other learning collections. In further research, the developed method will be tested on other learning sets, which will confirm the ability to handle various types of data. The biggest problem that can be encountered in using the stochastic gradient approach is the algorithm step.