Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Models of statistical learning and classification methods are vital components in many current applications, such as autonomous driving [13], natural language processing [2] and game ai [23]. A challenging aspect of machine learning concerns the balancing of classification accuracy and generalizability on unseen data, especially if only few training examples are available.

For classification problems, supervised learning has the aim to derive a decision function \(y = h({\varvec{x}})\) from a labeled training set \( Tr = ({\varvec{x}}_{i}, y_{i})_{i=1}^{N}\), where \({\varvec{x}} \in \mathbb {R}^{F}\) are feature vectors from an F-dimensional feature space and \(y \in C\) are labels chosen from a finite set of class labels. Different theoretical models and learning algorithms have been proposed in the past. Support-Vector-Machines (svm), originally developed by Cortes and Vapnik [5], have retained widespread usage due to their excellent theoretical underpinnings and their competitive performance on many datasets. Other vector machine approaches were later proposed to alleviate some of the shortcomings of svms. These include, for example, extensions for multi-class problems [1, 26], probabilistic decision functions [25, 26] and highly sparse solutions [25].

Herbrich et al. [7] presented their own take of a vector-machine classifier based on the concept of a Bayesian point estimate of the optimal parametrized decision plane. This Bayesian-Point-Machine (bpm) ties maximum-margin classification into a larger framework of Bayesian decision making. As a nontrivial byproduct, learning a bpm constructs an approximation of the Bayesian posterior over all classification models. This posterior distribution can, for example, be used to inexpensively derive various statistics for use in more complex decision models or to compute calibrated class membership probabilities. In this paper, we propose three improvements to the bpm classifier. Firstly, the bpm is based on a regularized hard-margin model. Although the bpm has been proven to have good generalization capabilities for the hard-margin case, this was never conclusively shown for the soft-margin variant. Our experiments in Sect. 5 show that this may not be the case. The regularization also introduces an additional hyperparameter into the model, which must be carefully tuned. To solve these problems, we will substitute the statistical data model with a true soft-margin model that contains no nuisance parameters. Secondly, we extend this new formulation to handle multi-class problems natively. These changes necessitate the development of a new sampling approach. Therefore, we introduce a novel sampling algorithm that can create a sample from our posterior with a runtime complexity of \(O(N^2 |C| + N|C|^2)\). Our statistical classifier will subsequently be called the Multi-class-Soft-margin-Bayes-Point-Machine (ms-bpm).

Our paper is composed as follows. Section 2 provides a brief introduction to bpms. Sections 3 and 4 then introduce our new soft-margin model and a fast multi-class sampling algorithm. We evaluate the generalization capabilities and class membership probabilities of the ms-bpm in Sect. 5 and conclude with Sect. 6.

2 Bayes-Point-Machines

The bpm utilizes a very simple statistical model. In the hard-margin case, all classifiers that manage to perfectly separate a training set \( Tr \) receive a uniform likelihood, while classifiers that generate at least one training error are discarded. The set of valid classifiers can be described by a convex polytope called the version space. Using a Bayes estimator with an assumed \(L_{2}\)-loss, the point estimate of an optimal decision plane is simply the center of mass of this version space. This point is also called the Bayes-point classifier. It was shown that the bpm generalizes the concept of maximum-margin classification and will often generalize at least as well as the svm [7]. The soft-margin case, where some margin is sacrificed to mitigate the effects of outliers and overlapping class distributions, was handled for the kernelized version of the algorithm by regularizing the Gram matrix. In effect, this allows for some misclassified training examples near the decision boundary. This approach introduces a tunable dataset-dependent hyperparameter whose value must be optimized, e.g. using cross-validation.

Since the Bayes-point can be formulated as an expectation over the posterior distribution of classification models, sampling methods based on the Markov-Chain-Monte-Carlo (mcmc) methodology can be an effective way of estimation [21]. In the original works, a billiard scheme was proposed to generate a sample from the uniformly distributed version space [7, 22]. Later works improved the computational efficiency using the Expectation Propagation algorithm by approximating the posterior under the assumption of local Gaussianity [14].

In the next section, we present our new statistical model that directly models the soft-margin case without introducing an additional hyperparameter. Furthermore, our model can be straightforwardly extended to multi-class problems.

3 Statistical Model of Soft Margin Classification

Sampling from the distribution of decision boundaries requires the definition of a posterior distribution \(p(\varvec{\beta } | Tr )\). This section therefore introduces and elucidates the required components of our statistical multi-class soft-margin model. This includes a parametrization \(\varvec{\beta }\) of the decision boundaries, a data-dependent likelihood term \(l( Tr | \varvec{\beta })\) and a prior distribution \(p(\varvec{\beta })\) for the model parameters.

3.1 Parametrization

A non-probabilistic classifier can be parametrized using any partitioning function that subdivides the feature-space into |C| partitions. To simplify the sampling, we will focus on linear partitionings. Non-linear decision boundaries can then be modeled via non-linear projections of the feature-space, e.g. using the kernel trick [8]. Following the example of generalized linear models [10], each class c has an associated linear predictor \(f_{c}({\varvec{x}})\). Given a feature-vector \({\varvec{x}}\), we always choose the class which produces the maximum response:

$$\begin{aligned} h({\varvec{x}})&= \mathop {{{\mathrm{arg\,max}}}}\limits _{c \in C} ~ f_{c}({\varvec{x}}) = \mathop {{{\mathrm{arg\,max}}}}\limits _{c \in C} ~ \varvec{\beta }_{c}^{T}{\varvec{x}} + \beta _{c,0}. \end{aligned}$$
(1)

The parameters \(\varvec{\beta }_{c}\) are the importance weights of the linear predictor for class c and \(\beta _{c, 0}\) its intercept. We will further call a specific instantiation parametrized by the vector \(\varvec{\beta }\) a configuration. Furthermore, the parameters of this model can be reduced by subtracting \(\varvec{\beta }_{1}^{T}{\varvec{x}} + \beta _{1,0}\) from all predictor functions. In this formulation, the anchor class \(c=1\) will always produce a zero response, while the remaining functions model the relative predictions for each class compared to the anchor class.

The remaining model parameters are still redundant in regard to uniform scaling. Herbich et al. [7] solved this problem for the two-class case by reparameterizing the model using a hyperspherical coordinate transform and normalizing the radius to 1. We argue that the original cartesian parametrization allows for a simpler mcmc sampling algorithm. We solve the redundancy in a more classical fashion by introducing appropriate priors on the model parameters.

3.2 Data Likelihood

The likelihood used by Herbrich et al. [7] is based on a simplified data model that is only valid for the hard-margin case. All configurations that achieve zero empirical training errors have a constant likelihood, while all configurations that produce at least one error are discarded. In case of outliers and overlapping class distributions, it may prove beneficial to admit at least some errors. In the original formulation, this is achieved by ignoring training errors that are geometrically close to the decision plane. For our soft-margin model, we would like to derive a likelihood that is more closely related to a well-defined data generating process. The likelihood of the entire training dataset \( Tr \) is usually defined by its log-loss:

$$\begin{aligned} l( Tr | \varvec{\beta })&= \exp ({{\mathrm{LogLoss}}}( Tr , \varvec{\beta })) = \prod _{i=1}^{N}{p(y_{i} | {\varvec{x}}_{i}, \varvec{\beta })}. \end{aligned}$$
(2)

The logistic regression [10], for example, substitutes the class label probabilities \(p(y_{i} | {\varvec{x}}_{i}, \varvec{\beta })\) with the logistic function \((1 + \exp ({\varvec{x}}_{i}^{t} \cdot \varvec{\beta }))^{-1}\). The bpm, on the other hand, assumes a 0–1 loss. We define \(p(y_{i} | {\varvec{x}}_{i}, \varvec{\beta }) = 1_{y_{i} = h({\varvec{x}}_{i})}\). It can be easily seen that even a single misclassified example pulls the entire likelihood down to zero, which is highly problematic for the non-separable case. Intuitively, this can be interpreted as the bpm model placing infinite confidence on the decisions of the learned classifier. In order to handle overlapping class distributions, we propose to regularize the model by additionally estimating the classification confidences from the data. Our modified likelihood reads as

$$\begin{aligned} l( Tr | \varvec{\beta }, \varvec{\pi })&= \prod _{i=1}^{N}{\pi _{y_{i}, h({\varvec{x}}_{i})}}, \end{aligned}$$
(3)

where \(\pi _{c, p} \in (0, 1)\) is the probability that an example \(x_{i}\) with a true class label \(c = y_{i}\) is classified as class \(p = h({\varvec{x}}_{i})\). These parameters would require dataset-dependent tuning. We can improve the robustness of our model in regard to the parameters \(\varvec{\pi }\) by placing an appropriate prior distribution on them, thus creating a hierarchical model. In the Bayesian spirit, we then marginalize these parameters. Assuming Dirichlet priors with parameters \(\varvec{\alpha }\), this produces the likelihood

$$\begin{aligned} \nonumber l_{\text {dm}}( Tr | \varvec{\beta }, \varvec{\alpha })&= \int { l( Tr | \varvec{\beta }, \varvec{\pi }) \cdot p_{\text {Dir}}(\varvec{\pi } | \varvec{\alpha }) \, \mathop {}\!\mathrm {d}\varvec{\pi }} \\&\propto \prod _{c=1}^{|C|} \frac{\prod _{p=1}^{|C|}{\varGamma (M_{c,p} + \alpha _{c,p})}}{\varGamma (\sum _{p=1}^{|C|}{M_{c,p} + \alpha _{c,p}})} , \end{aligned}$$
(4)

where \(\varGamma (.)\) is the Gamma function and \(M_{c,p}\) are the counts of how many training examples from class c were assigned to partition p. This model is also called a Dirichlet-multinomial or multivariate Pólya distribution [15]. In our model, the confidence we place on a classifier is largely based on the number of training examples it was derived from. In the separable case our regularized model will tend towards the bpm model for large N. Yet we still require a principled way of tuning the \(\alpha \) parameters. General pointers of parametrizing Dirichlet distributions can be gleaned from the statistical literature. Generally, we get an uninformative flat prior by setting \(\alpha _{c,p} = 1\). It turns out that this is not a sensible choice for classification models. As can be seen in Fig. 1, such a prior would place too much weight on models that exhibit high empirical errors. We need to guarantee that reductions in error always corresponds with increases in likelihood. This property trivially holds for the weakly informative prior with \(\alpha _{c, p} = 1\), \(c \ne p\) and \(\alpha _{c, c} = 1 + N\). Furthermore, we will introduce the regularization parameter \(\nu \) by setting \(\alpha _{c, c} = 1 + \frac{N}{\nu }\). This way, setting \(\nu \rightarrow \infty \) produces the uninformative prior while \(\nu \rightarrow 0\) strongly penalizes misclassifications and corresponds in the limit with the original bpm model. Sensible choices for \(\nu \) lie in the interval (0, 1], but our model is largely robust to the specific choice of \(\nu \). For all experiments, we simply set it fixed to \(\nu = 1\).

Fig. 1.
figure 1

Comparison of the log-likelihoods for a small training set (\(N = 100\)) plotted over the number of misclassified examples. Models under comparison are the Dirichlet-multinomial model with an uninformative Dirichlet prior and a weakly informative Dirichlet prior.

3.3 Feature Weight Prior

The parametrization introduced in Sect. 3.1 is redundant in regard to uniform scaling; that is \(\varvec{\beta } \equiv t \cdot \varvec{\beta }\) for \(t > 0\). This has the consequence that, given a uniform prior over the weights, the resulting posterior distribution will be improper. The typical solution involves replacing the uniform priors with proper ones. We would expect a good prior distribution to be zero-centered, symmetrical, weakly-informative and simple to compute. In past works, the normal distribution and the Laplace distribution have been used frequently, especially since they have strong ties to \(L_2\) and \(L_1\) regularization, respectively.

$$\begin{aligned} p(\varvec{\beta })&\propto \exp \left( - \frac{1}{2 \sigma ^{2}} ||\varvec{\beta } ||_{2}^{2} \right)&\text {Normal Prior} \end{aligned}$$
(5)
$$\begin{aligned} p(\varvec{\beta })&\propto \exp \left( -\frac{1}{\sigma } ||\varvec{\beta } ||_{1} \right)&\text {Laplace Prior} \end{aligned}$$
(6)

We can reduce the informativeness of the prior by increasing the scale parameter \(\sigma \). The main difference between the two models is that the normal prior produces more dense solutions, while the Laplace prior prefers sparsity in the weight parameters. In more recent works, even more sparsity inducing prior distributions have been used [25]. The original bpm approach is restricted to dense models. Although, for computational reasons, our current implementation only uses dense normal priors, the ms-bpm method could be used in conjunction with any of these sparsity inducing priors.

4 Fast Multi-class MCMC Sampler

In this section, we will introduce an efficient sampling scheme for our proposed statistical model. Multivariate sampling is achieved by performing fast univariate sampling along randomized search directions. Quick convergence can then be reached by adapting the distribution of search directions to the local properties of the posterior distribution.

4.1 Univariate vs Multivariate Sampling

The optimization of such classification problems based on a 0–1 loss is known to be NP-hard [17]. Each training example splits the likelihood along \(|C| - 1\) half-spaces. As such, the potential number of equivalence-classes of different solutions can be stated as \(2^{N \cdot (|C| - 1)}\). Direct sampling from the multivariate posterior distribution quickly becomes prohibitively expensive even for small training sets. As can be seen in Fig. 2, the posterior distribution tends to be highly discontinuous, which is a direct result of our choice of the 0–1 loss. The lack of useful gradient information also diminishes the effectiveness of a large class of mcmc algorithms, such as billiard schemes [16], Hamiltonian Monte Carlo [9] and covariance adaptive slice sampling [24]. One important observation is that arbitrary univariate sampling paths can only intersect at most \(N \cdot (|C| - 1)\) discontinuities. This implies that a univariate sampling algorithm could be implemented with a much lower computational complexity than a multivariate one. We describe such a sampling method in Sect. 4.2. To facilitate fast convergence of the Markov chain, it is essential to select useful search directions with a high probability. Our univariate sampler can be directly embedded in a number of higher-level sampling methods, such as Gibbs-sampling [21], Hit-and-Run [4] and Adaptive Direction Sampling (ads) [6].

Fig. 2.
figure 2

A heatmap of the posterior distribution for a two-class toy problem with a 1D feature-space. Discontinuities are represented by gray stippled lines.

Fig. 3.
figure 3

Example of a three-class problem. The upper envelope of the linear predictor functions defines the partition membership intervals for one individual training example \({\varvec{x}}\) along the search path. In this particular case, class 2 is selected left of the discontinuity and class 3 otherwise. Class 1 is never selected.

4.2 Efficient Univariate Sampling Along Arbitrary Search Paths

Our fast univariate sampler will start at a configuration \(\varvec{\beta }_{t}\) and be given a search direction \({\varvec{d}}\). The set of possible configurations

$$\begin{aligned} \varvec{\beta }_{t+1}&= \varvec{\beta }_{t} + u \cdot {\varvec{d}} \end{aligned}$$
(7)

forms our search path for the next configuration. It is important to realize that \(\varvec{\beta }_{t}\) and \({\varvec{d}}\) are fixed. As such, u is the random variable that we are actually sampling from. The task can also be stated as a problem of sampling from the posterior distribution conditioned on the search path in Eq. (7).

Our approach to this problem can be broken down into the following four steps:

  1. 1.

    Find all discontinuities along the search path.

  2. 2.

    Construct a discrete distribution of all intervals spanned by two consecutive discontinuities.

  3. 3.

    Draw an interval from this distribution.

  4. 4.

    Finally, draw a new configuration from the selected interval.

The first step can be directly tackled by substituting (7) into (1) as follows:

$$ h({\varvec{x}}) = \mathop {{{\mathrm{arg\,max}}}}\limits _{c \in C} ~ (\varvec{\beta }_{t,c} + u \cdot {\varvec{d}}_{c})^{T}{\varvec{x}}_{i} + (\beta _{t,c,0} + u \cdot d_{c,0}) . $$

Each example \({\varvec{x}}_{i}\) in the training set will generate at most \(|C| - 1\) discontinuities. These are situated at values for u, where the predictor functions of two classes become equal. By computing the upper envelope of all |C| predictor functions, e.g. using the convex-hull trick [18], we can find the partition assignment intervals for all configurations on the search path. The discontinuities actually mark the transitions between equivalence classes of solutions. Figure 3 shows an example for a three-class problem.

Next for step 2, we store the discontinuities for all training examples in a list and sort them in ascending order. Our aim is to visit all discontinuities in a successive order. This allows us to efficiently update the counts \(M_{c,p}\), which are required to evaluate the likelihood. We will start by initializing the counts for \(u=-\infty \). Each discontinuity along the search path marks a point, where a training example switches from being classified as \(p=p'\) to \(p=p''\). We update the counts accordingly:

$$\begin{aligned} M_{y_{i}, p'}&= M_{y_{i}, p'} - 1 \\ M_{y_{i}, p''}&= M_{y_{i}, p''} + 1. \end{aligned}$$

To compute the interval probability for the discretized sampling problem, we have to integrate over the conditional posterior density:

$$\begin{aligned} \nonumber p_{j}&= \int _{u_{j}}^{u_{j+1}}{l_{\text {dm}}(M | \varvec{\alpha }) \cdot p(\varvec{\beta }_{t} + u \cdot {\varvec{d}}) \, \mathop {}\!\mathrm {d}u}\\&= l_{\text {dm}}(M | \varvec{\alpha }) \cdot \int _{u_{j}}^{u_{j+1}}{p(\varvec{\beta }_{t} + u \cdot {\varvec{d}}) \, \mathop {}\!\mathrm {d}u}. \end{aligned}$$
(8)

Notice that the likelihood remains constant over the entire interval, since it only depends on the counts M. Integrating over the prior distribution is also trivial for the case of an isotropic normal distribution.

After selecting an interval from the discretized interval distribution for step 3, all that remains is to draw a new configuration from the selected interval in step 4. Once again, we make use of the fact that the likelihood is constant. Therefore, the problem reduces to sampling from the prior distribution, conditioned on the selected interval. In our case, this means to draw a configuration \(\varvec{\beta }_{t+1}\) from an appropriately parametrized truncated normal distribution.

The runtime complexity of our sampling algorithm can be stated as \(O(N^2|C| + N|C|^2)\) for the kernelized version. For typical datasets, where \(N \gg |C|\), this is equivalent to the fast approximated bpm approach in [7] (\(O(N^2|C|)\)) and compares favorably with support-vector-machines (\(O(N^3 |C|)\)), relevance-vector-machines (\(O(N^3 |C|)\)) and import-vector-machines (\(O(N^2 q^2 |C|)\)) (we assumed a o-vs-r scheme for the non multi-class methods). Of course, sampling based methods will usually also incur a much higher constant factor compared to optimization based learning algorithms, so this advantage may only play out for very large datasets.

4.3 Choosing Good Search Directions

Reliably choosing good search directions is of great importance. Two non-adaptive methods are Gibbs sampling [21] and Hit-and-Run sampling [4]. Gibbs sampling proposes to only use search directions that are parallel to the axes of the parameter space. The sampler alternates between these directions using either a predefined schedule or random schedule. In the case that two or more of the parameters are highly correlated, the Markov chain may be required to temporarily assume a low-probability state in order to reach more promising parts of the posterior distribution. This property may cause slow convergence. Hit-and-Run sampling, on the other hand, chooses a uniformly sampled random search direction at each iteration. It is more robust and can often show surprisingly fast convergence [12]. An adaptive sampling scheme, which exploits knowledge about the local correlation structure between parameters, is expected to significantly improve convergence in most cases. One simple adaptive sampling method is the ads scheme [6]. ads works by sampling multiple Markov chains in parallel. At each step, one chain is randomly chosen to be iterated on. In contrast to Hit-and-Run sampling the search direction is however not chosen uniformly. Information from two other randomly selected chains is utilized in order to steer the search along the principal directions of the parameter-space. To avoid the sampler getting stuck in a particular subspace of the parameter-space, some precautions have to be made. Following the findings of Gilks et al. [6], it has proven effective to occasionally use a search direction generated by a non-adaptive method. The sampling behavior is typically very robust in regard to this selection probability. In our implementation, e.g., we arbitrarily fixed it to select ads with \(85\%\) probability and Hit-and-Run sampling with \(15\%\) probability.

5 Evaluation

In this section, we evaluate our proposed ms-bpm method. We use the original bpm model with soft-margin regularization and the svm as baseline methods. For all experiments, we simulated 200 independent mcmc chains of length 50, using 1000 iterations for the ads sampler. We set \(\nu = 1\) as described in Sect. 3.2. The hyperparameters for the svm and bpm classifiers were optimized using a grid-search approach. All methods use the same rbf kernel using the kernel \(\gamma \) that was selected during the grid-search for the svm runs. The kernel parametrizations for all methods were implemented exactly as in [3].

5.1 UCI Datasets

Our main evaluation is based on the commonly used supervised learning datasets from the uci database [11]. These seven datasets cover a range of different classification problems of varying size, feature space and number of classes. We show the validity of our method for small and large training sets by training on \(10\%\) and \(50\%\) bootstrap samples for each dataset. The presented values in Table 1 show the out-of-bag accuracies for 100 independent runs and their standard deviations. As can be seen, our ms-bpm method displays similar performance characteristics as the baseline svm classifier, yet it does not require hand-tuning of any regularization hyperparameters. The original bpm approach for soft-margin classification regularized the Gram-matrix to allow for some empirical errors on the training set. Our experiments show that this approach is not competitive with our improved statistical model on most datasets, and especially for small training sets. The large standard deviations also indicate some robustness problems that are not observable in our method. A Wilcoxon signed rank test shows with a \(97.5\%\) confidence level that our ms-bpm classifier significantly improves on the bpm classifier. The same test is inconclusive when used to compare the results of the ms-bpm and svm classifiers.

Table 1. Out-of-bag accuracy estimates for the original regularized bpm, the svm and our ms-bpm classifier on uci datasets. Estimates were averaged over 100 bootstrap runs for simulated training sets of small (\(10\%\)) and large (\(50\%\)) size. For each experiment, the best result is printed in bold. Ties with the first place, as determined by a two-sample t-test with a \(97.5\%\) confidence interval, are also printed in bold.

5.2 Class Membership Probabilities

The class membership probabilities generated by our model often tend to better represent the true probabilities than classifiers that were calibrated subsequently after training, e.g. using Platt scaling [19]. This difference only gets amplified for small training sets, as any post-hoc calibration has to be based on a sub-sampling method, such as cross-validation. Figure 4 compares the membership probabilities for an svm model and our classifier on the Ripley synthetic dataset [20]. This dataset features a two-class problem in a two-dimensional feature space. Both classes are mixtures of two Gaussians with distinct modes. This difference can be measured by comparing the log loss (\(E[-\log (p(y_{i} | {\varvec{x}}_{i}))]\)) as estimated from a test sample. Our experiment gave the following results:

$$\begin{aligned} {{\mathrm{LogLoss}}}_{ SVM }&= 0.15 \\ {{\mathrm{LogLoss}}}_{ MS - BPM }&= 0.08 \end{aligned}$$

Thus, our ms-bpm model improves over the svm by approximately \(53\%\). Most of the gains come from the improved estimation of membership probabilities in the higher-density regions of the dataset.

Fig. 4.
figure 4

Posterior class membership probabilities for the Ripley synthetic dataset. Both classifiers use an rbf kernel with \(\gamma =7.5\) as selected via grid search. The heatmaps show the posterior plots of the learned classification models while the overlaid contour plots depict the true probabilities. The ms-bpm classifier (right) tends to generate more confident predictions than the svm classifier (left) and produces smoother class boundaries.

6 Conclusion

In this paper, we presented our proposed improvements to the bpm classifier. The experiments demonstrated that our ms-bpm model exhibits similar performance to svms and significantly improves on the original bpm, especially for small training sets. Yet it requires less hand-tuning of hyperparameters while also supporting multi-class problems natively. We also showed that the class membership probabilities generated by our model are superior to post-hoc calibrated probabilities for maximum-margin models. The algorithmic complexity of our learning algorithm (\(O(N^2 |C| + N |C|^2)\)) also compares favorably to other kernelized vector-machine classifiers.