Keywords

1 Introduction

Support vector machines (SVMs) are a supervised classifier that has been successfully deployed to solve a variety of pattern recognition and computer vision tasks. SVM training consists in determining a hyperplane to separate the training data belonging to two classes. Position of this hyperplane is defined with a (usually small) subset of all the vectors from the training set (\(\varvec{T}\))—the selected ones are termed support vectors (SVs). Though the decision hyperplane separates the data linearly, the input vectors could be mapped into higher-dimensional spaces, in which they become linearly separable—this mapping is achieved with kernel functions. The most frequently used kernel, which we also consider in this paper, is the radial basis function (RBF): \(\mathcal {K} \left( \varvec{u}, \varvec{v} \right) = \exp \left( - \gamma {\left\| \varvec{u} - \varvec{v} \right\| ^2}\right) \), where \(\varvec{u}\) and \(\varvec{v}\) are the input vectors, and \(\gamma \) is the kernel width.

Selecting the SVM model (\(\mathcal {M}\)), namely the kernel function along with its parameters and the slack penalty coefficient (C) stays among the main difficulties to be faced while applying SVMs in practice. Failing to properly tune these hyperparameters leads to poor performance of SVMs and it is hardly possible to estimate the optimal values a priori. This problem has been extensively studied and a number of solutions were proposed, including improvements to the standard grid search [19], trial-and-error approaches [6] and a variety of automated methods, which often involve evolutionary computation [3].

Another important problem that is becoming increasingly important nowadays, in the era of big data, is concerned with high \(O(t^3)\) time and \(O(t^2)\) memory complexity of training SVMs, where \(t\) is the cardinality of \(\varvec{T}\). Furthermore, the number of obtained SVs (\(s\)) depends in practice on \(t\), and the classification time depends linearly on \(s\). Overall, there are two principal problems here: (i) some data sets are too large to train SVM, and (ii) large training sets may result in slow classification, even if the SVM can be trained. Apart from enhancing the SVM training [12], this obstacle can be mitigated by selecting a small subset of the training set (\(\varvec{T'}\)), which contains the potential SVs capable of determining a proper decision hyperplane. This makes SVM training feasible for very large sets, and usually reduces the resulting \(s\) without affecting the classification score. Selecting a “good” training set is not trivial, though, and this problem has attracted considerable research attention, including our earlier successful attempts to exploit genetic [11] and memetic [16] algorithms.

1.1 Contribution

Tuning the SVM model is tricky, when the training set is to be reduced, because of the mutual dependence—the model is usually required to select \(\varvec{T'}\), while depending on the subset used for training, a different SVM model may be optimal. In most of the existing approaches, the model is selected either for the entire \(\varvec{T}\) (when the goal is to reduce \(s\) rather than enable the SVM training at all), or for a randomly selected \(\varvec{T'}\), prior to its proper refinement. In this paper, we propose a new approach towards solving these two optimization problems in an alternating manner, using a genetic algorithm (GA). We focus on the RBF kernel, hence the SVM model \(\mathcal {M}=(\gamma , C)\) in this case, and we exploit our earlier GASVM algorithm [10] for training set selection. The key aspects of our contribution are as follows: (i) we make it possible to run a single optimization process to select both \(\mathcal {M}\) and \(\varvec{T'}\), (ii) we alternately evolve two different populations to solve two optimization problems having a common fitness function, (iii) we establish a new scheme that can easily embrace other techniques for selecting \(\varvec{T'}\) and may be enhanced beyond relying on the RBF kernel.

1.2 Paper Structure

In Sect. 2, we outline the state-of-the-art on selecting \(\mathcal {M}\) and training SVMs from large sets. Our approach is described in Sect. 3 and the experimental results are reported in Sect. 4. The paper is concluded in Sect. 5.

2 Related Literature

To the best of our knowledge, there are no methods for selecting \(\mathcal {M}\) and \(\varvec{T'}\) in a simultaneous manner, however there are many approaches to solve these problems independently from each other. They are briefly outlined in this section.

Model selection for SVMs is a computationally expensive task, especially if the trial-and-error approaches are utilized [6]. It may consist in selecting parameters of the predefined kernels [23], but also the desired kernel can be determined—in [5], this is achieved using an evolution strategy. GAs are also applied for this purpose, as in [25] for the smooth twin parametric-margin SVMs. In another recent algorithm [3], the SVM parameters are optimized using a fast messy GA.

Other interesting approaches include tabu searches [13], genetic programming [22], and compression-based techniques [14] (the coding precision is related to the geometrical characteristics of the data). A dynamic model adaptation strategy, which combines the swarm intelligence with a standard grid search, was proposed in [9]. A promising research direction is to construct new kernels tailored for a problem at hand, including the use of neuro-fuzzy systems [17, 21].

The algorithms to deal with large training sets can be divided into those which: (i) enhance the SVM training [4, 8, 12], and (ii) decrease the size of training sets by retrieving the most valuable vectors. Importantly, the approaches from the first group still induce the problem of high memory complexity of the training which must be endured in big data problems.

Decreasing the size of \(\varvec{T}\) makes SVM training feasible for large data sets, but it also allows for reducing \(s\), which accelerates the classification. The methods which exploit the information on the layout of \(\varvec{T}\), encompass clustering-based algorithms [20, 24] and those utilizing the \(\varvec{T}\)’s geometry without grouping the data [1]. Significant research effort has been put into proposing algorithms exploiting statistical properties of the \(\varvec{T}\) vectors [7]. Other techniques include various random sampling algorithms [18] as well as the induction trees [2].

In our recent research, we pioneered the use of evolutionary algorithms for this task. In our initial approach (GASVM) [10], a population of individuals (chromosomes) representing refined sets of a fixed size (\(t'\)), evolves in time using standard genetic operators—selection, crossover, and mutation. GASVM was enhanced to dynamically adapt the crucial parameters, including \(t'\), during the evolution [11]. We also exploited the knowledge concerning \(\varvec{T}\), attained during the evolution or extracted beforehand in our memetic algorithms [15, 16].

3 Proposed Method

In the work reported here, we introduce ALGA—ALternating Genetic Algorithm for selecting the SVM model and refining the training set. ALGA alternates between two main phases—one is aimed at optimizing \(\varvec{T'}\), while the other optimizes \(\mathcal {M}\), as illustrated in Fig. 1. The model selection phase is inspired by [3], while for selecting the training set, we exploit our relatively simple GASVM algorithm [10] to verify the very foundations of the new alternating scheme. The pseudocode of ALGA is given in Algorithm 1. In the first generation (\(G_0\)), two populations representing (i) refined training sets \(\varvec{T'}\) \(\left( \{p_i\}\right) \) and (ii) SVM models \(\left( \{q_i\}\right) \) are initialized (lines 1–2). For a randomly chosen SVM model (\(q_\mathrm{init}\)), we evaluate every individual \(p_i\) to select the best one (\(p^B\)) (lines 4–7). Afterwards, the algorithm enters the model optimization phase (lines 9–15), in which \(\mathcal {M}\) is optimized for the currently best refined training set \(p^B\). If the average fitness of all individuals in the population does not grow in two subsequent generations, then the local stop condition is met (line 15) and the algorithm enters the training set optimization phase (lines 17–23)—\(\varvec{T'}\) is optimized using the currently best hyperparameters (\(q^B\)), again until the same local stop condition is met (line 23). This alternating process is repeated as long as at least one of two subsequent phases manages to improve the average fitness. Otherwise, the global stop condition (line 25) is reached and the SVM trained with the best individuals \(\left( p^B;q^B\right) \) is retrieved.

Fig. 1.
figure 1

Flowchart of the proposed method.

figure a

3.1 Individuals and Their Fitness

The process of computing the fitness is shown in Fig. 2. A chromosome \(p_i\) defines a single \(\varvec{T'}\subset \varvec{T}\), containing K vectors from each class, hence the length of the chromosome is 2K. SVM model is represented with \(q_j\)—as we limit the search to the RBF kernel, the chromosome contains two elements \((\gamma _j, C_j)\). Two individuals from two populations are required to train the SVM, which is subsequently used to classify the validation set \(\varvec{V}\). Based on the ground-truth labels from \(\varvec{V}\), the classification accuracy is evaluated and used as the fitness \(\eta (p_i;q_j)\). Importantly, the test set \(\varPsi \) used for final evaluation, is not seen during the optimization.

Fig. 2.
figure 2

The process of computing the fitness.

3.2 Selecting Training Set

Each individual \(p\) in the initial population (of a size \(N\)) is created by choosing a randomly selected subset of K vectors from each class of \(\varvec{T}\). For selection, we exploit the high-low fit scheme. The population is sorted by the fitness, evaluated as outlined in Fig. 2, and divided into two equally-sized parts. The parent \(p_{A}\) is selected from the more fitted part, while the parent \(p_{B}\) is drawn from the less-fitted part of the population. The offspring solutions are appended to the population, forming a new population of a size \(2N\). The \(N\) individuals with the highest fitness survive to maintain the constant population size.

Crossover of two individuals \(p_a\) and \(p_b\) is done by creating a sum of two training sets defined by these individuals, from which 2K unique samples are selected randomly to form \(p_{a+b}\). Then, \(p_{a+b}\) is subject to mutation with the probability \(\mathcal {P}_m^{\varvec{T'}}\). Finally, \(\lfloor 2K\cdot f_m\rfloor \) randomly-chosen vectors are substituted with others from \(\varvec{T}\) (it is assured that \(\varvec{T'}\) contains unique elements).

3.3 SVM Model Optimization

The process of optimizing the hyperparameters is similar to the training set optimization with three main differences concerning (i) initialization, (ii) crossover and (iii) mutation. The first population is initialized deterministically to cover a large range of the values with a logarithmic step (\(\gamma \in \{0.01, 0.1, 1, 10, 100\}\) and \(C \in \{0.1, 1, 10, 100\}\), hence \(M=20\)). Two individuals \(q_a=(\gamma _a,C_a)\) and \(q_b=(\gamma _b, C_b)\) are crossed over (with the probability \(\mathcal {P}_c^{\mathcal {M}}\)) to form a new individual \(q_{a+b}=(\gamma _{a+b},C_{a+b})\). The child parameter values \(x_{a+b}\) (\(x_{a+b}\) is either \(\gamma \) or C) become \(x_{a+b}=x_{a}+\alpha _{\mathcal {M}}\cdot (x_{a}-x_{b})\), where \(\alpha _{\mathcal {M}}\) is the crossover weight randomly drawn from the interval \(\left[ -0.5,1.5\right] \) to diversify the search (and \(x_{a+b}>0\)). Mutation is proceeded with the probability \(\mathcal {P}_m^{\mathcal {M}}\), and it consists in modifying a value x within a range \(x \in \left[ x-\mathcal {\delta }_m\cdot x,x+\mathcal {\delta }_m\cdot x \right] \), where x is \(\gamma \) or C.

4 Experimental Validation

The algorithms were implemented in C++ (using LIBSVM) and run on a computer equipped with an Intel Xeon 3.2 GHz (16 GB RAM) processor. The parameters of the GA were set experimentally to \(N=20\), \(\mathcal {P}_m^{\varvec{T'}}=0.3\), \(f_m=0.2\), \(\mathcal {P}_m^{\mathcal {M}}=0.2\), \(\mathcal {\delta }_m=0.1\) and \(\mathcal {P}_c^{\mathcal {M}}=0.7\) (such values are commonly used in GAs, including our earlier works [10, 11]). We tested our algorithms using three sets of 2D pointsFootnote 1, for which we visualize the results: 2d-random-dots (\(1.7\cdot 10^4\) samples), 2d-random-points (1552 samples) and 2d-chessboard-dots (\(2.7\cdot 10^4\) samples). In the dots variants, groups of the vectors from each class form clusters on the 2D plane, while in the points variant the vectors are isolated. Each set is divided into: a training set \(\varvec{T}\) (from which \(\varvec{T'}\)’s are selected), a validation set \(\varvec{V}\) (for which the fitness is evaluated) and a test set \(\varPsi \). Furthermore, ALGA was validated for three benchmark sets from the UCI repository: German, Ionosphere and Wisconsin breast cancer, using 5-fold cross-validation. ALGA was run \(30{\times }\) for each 2D set and \(50{\times }\) for the benchmarks (\(10{\times }\) for every fold), and we verified the statistical significance of the differences using two-tailed Wilcoxon test.

Fig. 3.
figure 3

Pareto fronts (the accuracy for \(\varvec{V}\) vs. \(s\)) for (a) 2d-random-dots, (b) 2d-random-points and (c) 2d-chessboard-dots (for GASVM and ALGA, K is given in parentheses).

We compared ALGA against the SVM trained with whole \(\varvec{T}\), whose model is optimized using grid search (GS) with a logarithmic step (we start with a step of 10, subsequently decreased to 2 for the best range) as well as with our GA working only in the \(\mathcal {M}\) optimization phase (termed GA-model), trained with the whole \(\varvec{T}\). Furthermore, we report the scores obtained using GASVM with the model selected with GS, performed using \(\varvec{T}\). ALGA and GASVM were tested for different values of K (we start with K equal to the data dimensionality and then we increase it logarithmically with a step of 4, until 2K reaches \(t/2\)).

Table 1. Results for the 2D data sets—accuracy and evolution time (in seconds).

In Fig. 3, we report the results obtained for the 2D sets—the Pareto fronts present the accuracy obtained using different K’s for \(\varvec{V}\) (i.e., the fitness) vs. the number of SVs (the smaller, the better). It can be seen that ALGA renders better or comparable scores to GASVM, taking into account both criteria (the differences are statistically significant at p = .05, except for 2d-random-points for \(K=8\) and 2d-chessboard-dots for \(K\in \{2,8\}\)). GA-model and GS usually allow for high accuracy at the cost of very high \(s\), which in fact leads to the overfitting (this is the reason for low accuracy of GS for 2d-random-dots). This problem can also be seen in Fig. 4, where we visualize the results for selected K’s and compare them with GS and GA-model. Black and white points indicate the vectors from \(\varvec{V}\), and those marked with white and black crosses (the colors are swapped for better visualization) show the data selected to \(\varvec{T'}\) (yellow crosses indicate the SVs). It can be seen that when the model is selected with GS or GA-model, there are lots of SVs and the kernel width is small (the same happens for GASVM, as it relies on the model obtained with GS). The model selected with ALGA better discovers the data structure—this is because the limited size of \(\varvec{T'}=2K\) requires such a model, for which the SVM must generalize well to classify \(\varvec{V}\) (this is most evident for 2d-random-dots, where GASVM fails to generalize).

Fig. 4.
figure 4

Examples of the results retrieved using various methods for 2D data sets. (Color figure online)

The scores obtained for the 2D test sets are reported in Table 1. The tendencies are the same as those observed for \(\varvec{V}\)’s. Although the evolution times for ALGA are longer than for GASVM, the latter requires the model to be selected beforehand—here, we used GS, which is very time consuming, especially for 2d-chessboard-dots—the largest set used in our experimental study.

For the UCI sets (the scores are shown in Table 2), GA-model is overfitted to \(\varvec{V}\), resulting in perfect classification with large \(s\) values, however still the accuracy for \(\varPsi \) is similar to that obtained using other methods. ALGA and GASVM render similar accuracies (though statistically different)—ALGA behaves slightly better for \(\varvec{V}\), while GASVM delivers slightly higher accuracy for \(\varPsi \). Although GASVM is better here in terms of the classification performance, ALGA remains competitive, without the need for selecting \(\mathcal {M}\) prior to the training set selection.

Table 2. Results obtained for the UCI data sets.

5 Conclusions and Outlook

In this paper, we introduced ALGA—a new approach to select both the SVM model, as well as the training set, within a single optimization process that alternates between two phases. We incorporated relatively simple GAs to solve these two optimization problems and we compared our method with the same GAs applied in a sequential manner. This allowed us to verify the very foundations of the new scheme and based on extensive experiments we showed that ALGA is capable of selecting the training set without the necessity to tune the SVM hyperparameters beforehand (which is non-trivial for the sets that are too large to train SVM from them).

The main shortcoming of the new method is the same as that of GASVM, which we use to refine the training set—it is necessary to select a proper value of K prior to starting the optimization. Importantly, we addressed this problem recently [11] by increasing the chromosome’s length during the evolution, and we developed a memetic algorithm [16] which outperforms the state-of-the-art methods, including those based on data structure analysis [24]. Therefore, our ongoing work is aimed at enhancing ALGA with these adaptive and memetic algorithms, which we expect to increase its competitiveness substantially. Furthermore, we aim at improving the model selection phase as well—instead of tuning the parameters of predefined kernel functions, we plan to allow for selecting them or constructing from scratch. This will be an important step towards parameter-less evolutionary SVMs, easily applicable to deal with large data sets.