Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings

Nakayama, Yugo; Yata, Kazuyoshi; Aoshima, Makoto

doi:10.1007/s10463-019-00727-1

Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings

Published: 15 July 2019

Volume 72, pages 1257–1286, (2020)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Yugo Nakayama¹,
Kazuyoshi Yata² &
Makoto Aoshima²

692 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, we study asymptotic properties of nonlinear support vector machines (SVM) in high-dimension, low-sample-size settings. We propose a bias-corrected SVM (BC-SVM) which is robust against imbalanced data in a general framework. In particular, we investigate asymptotic properties of the BC-SVM having the Gaussian kernel and compare them with the ones having the linear kernel. We show that the performance of the BC-SVM is influenced by the scale parameter involved in the Gaussian kernel. We discuss a choice of the scale parameter yielding a high performance and examine the validity of the choice by numerical simulations and actual data analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Asymptotic properties of distance-weighted discrimination and its bias correction for high-dimension, low-sample-size data

Article Open access 09 August 2021

An Empirical Comparison of Support Vector Machines Versus Nearest Neighbour Methods for Machine Learning Applications

Support Vector Machine Failure in Imbalanced Datasets

References

Ahn, J., Marron, J. S. (2010). The maximal data piling direction for discrimination. Biometrika, 97, 254–259.
Article MathSciNet MATH Google Scholar
Ahn, J., Marron, J. S., Muller, K. M., Chi, Y.-Y. (2007). The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika, 94, 760–766.
Article MathSciNet MATH Google Scholar
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750.
Article Google Scholar
Aoshima, M., Yata, K. (2011). Two-stage procedures for high-dimensional data. Sequential Analysis, 30, 356–399. (Editor’s special invited paper).
Article MathSciNet MATH Google Scholar
Aoshima, M., Yata, K. (2014). A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. Annals of the Institute of Statistical Mathematics, 66, 983–1010.
Article MathSciNet MATH Google Scholar
Aoshima, M., Yata, K. (2015). Geometric classifier for multiclass, high-dimensional data. Sequential Analysis, 34, 279–294.
Article MathSciNet MATH Google Scholar
Aoshima, M., Yata, K. (2018a). Two-sample tests for high-dimension, strongly spiked eigenvalue models. Statistica Sinica, 28, 43–62.
MathSciNet MATH Google Scholar
Aoshima, M., Yata, K. (2018b). High-dimensional quadratic classifiers in non-sparse settings. Methodology and Computing in Applied Probability. https://doi.org/10.1007/s11009-018-9646-z.
Aoshima, M., Yata, K. (2019). Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models. Annals of the Institute of Statistical Mathematics, 71, 473–503.
Article MathSciNet MATH Google Scholar
Bai, Z., Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statistica Sinica, 6, 311–329.
MathSciNet MATH Google Scholar
Benjamin, X. W., Nathalie, J. (2010). Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 25, 1–20.
Article Google Scholar
Chang, J. C., Wooten, E. C., Tsimelzon, A., Hilsenbeck, S. G., Gutierrez, M. C., Elledge, R., Mohsin, S., Osborne, C. K., Chamness, G. C., Allred, D. C., O’Connell, P. (2003). Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. The Lancet, 362, 362–369.
Article Google Scholar
Chan, Y.-B., Hall, P. (2009). Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika, 96, 469–478.
Article MathSciNet MATH Google Scholar
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.
Article Google Scholar
Hall, P., Marron, J. S., Neeman, A. (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society, Series B, 67, 427–444.
Article MathSciNet MATH Google Scholar
Hall, P., Pittelkow, Y., Ghosh, M. (2008). Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. Journal of the Royal Statistical Society, Series B, 70, 159–173.
Article MathSciNet MATH Google Scholar
He, H., Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21, 1263–1284.
Article Google Scholar
Huang, H. (2017). Asymptotic behavior of support vector machine for spiked population model. Journal of Machine Learning Research, 18, 1–21.
MathSciNet MATH Google Scholar
Marron, J. S., Todd, M. J., Ahn, J. (2007). Distance-weighted discrimination. Journal of the American Statistical Association, 102, 1267–1271.
Article MathSciNet MATH Google Scholar
Nakayama, Y., Yata, K., Aoshima, M. (2017). Support vector machine and its bias correction in high-dimension, low-sample-size settings. Journal of Statistical Planning and Inference, 191, 88–100.
Article MathSciNet MATH Google Scholar
Nutt, C. L., Mani, D. R., Betensky, R. A., Tamayo, P., Cairncross, J. G., Ladd, C., Pohl, U., Hartmann, C., McLaughlin, M. E., Batchelor, T. T., Black, P. M., von Deimling, A., Pomeroy, S. L., Golub, T. R., Louis, D. N. (2003). Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Research, 63, 1602–1607.
Google Scholar
Qiao, X., Zhang, L. (2015). Flexible high-dimensional classification machines and their asymptotic properties. Journal of Machine Learning Research, 16, 1547–1572.
MathSciNet MATH Google Scholar
Qiao, X., Zhang, H. H., Liu, Y., Todd, M. J., Marron, J. S. (2010). Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association, 105, 401–414.
Article MathSciNet MATH Google Scholar
Schölkopf, B., Smola, A. J. (2002). Learning with Kernels. Cambridge: MIT Press.
MATH Google Scholar
Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., Gaasenbeek, M., et al. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine, 8, 68–74.
Article Google Scholar
Vapnik, V. N. (2000). The Nature of Statistical Learning Theory2nd ed. New York: Springer.
Book MATH Google Scholar

Download references

Author information

Authors and Affiliations

Graduate School of Pure and Applied Sciences, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8571, Japan
Yugo Nakayama
Institute of Mathematics, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8571, Japan
Kazuyoshi Yata & Makoto Aoshima

Authors

Yugo Nakayama
View author publications
You can also search for this author in PubMed Google Scholar
Kazuyoshi Yata
View author publications
You can also search for this author in PubMed Google Scholar
Makoto Aoshima
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Makoto Aoshima.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We are very grateful to the associate editor and the reviewer for their constructive comments. The research of the second author was partially supported by Grant-in-Aid for Scientific Research (C), Japan Society for the Promotion of Science (JSPS), under Contract Number 18K03409. The research of the third author was partially supported by Grants-in-Aid for Scientific Research (A) and Challenging Research (Exploratory), JSPS, under Contract Numbers 15H01678 and 17K19956.

Appendices

Appendix A: soft-margin SVM

In Sects. 2–5, we discussed asymptotic properties and the performance of the hard-margin SVMs (hmSVM). In this section, we consider soft-margin SVMs (smSVM). The smSVM is given by ${\hat{y}}({{{\varvec{x}}}})$ after replacing (4) with

$$\begin{aligned} 0\le \alpha _j \le C,\ j=1,\ldots ,N, \ \hbox { and } \ \sum _{j=1}^N\alpha _jt_j=0, \end{aligned}$$

(36)

where $C(>0)$ is a regularization parameter. Let $n_{\min }=\min \{n_1, n_2\}$. From (11) in Sect. 2, we can asymptotically claim that ${\hat{\alpha }}_j \le 2/(\varDelta _* n_{\min })$ for all j. Thus, we consider the following condition for C:

$$\begin{aligned} \liminf _{d\rightarrow \infty } \frac{C\varDelta _* n_{\min } }{2}>1 . \end{aligned}$$

(37)

Let ${\hat{y}}_{(S)}({{{\varvec{x}}}}_0)$ and ${\hat{y}}_{BC(S)}({{{\varvec{x}}}}_0)$ denote ${\hat{y}}({{{\varvec{x}}}}_0 )$ and ${\hat{y}}_{BC}({{{\varvec{x}}}}_0)$ after replacing (4) with (36), respectively. Then, we have the following result.

Proposition 7

Assume (A-i), (A-i’) and (8). Under (37), it holds that when ${{{\varvec{x}}}}_0\in \varPi _i$ for $i=1,2$

$$\begin{aligned} {\hat{y}}_{(S)}({{{\varvec{x}}}}_0)=\frac{\varDelta }{\varDelta _*}\Big ((-1)^i+\frac{\delta }{\varDelta } +o_P(1)\Big ) \ \hbox { and } \ {\hat{y}}_{BC(S)}({{{\varvec{x}}}}_0)=\frac{\varDelta }{\varDelta _*}\{(-1)^i+o_P(1)\}. \end{aligned}$$

From Proposition 7, the bias-corrected smSVM (BC-smSVM) holds the consistency (6) even when $|\delta /\varDelta |\rightarrow \infty $. Hence, for smSVMs, we recommend to use the BC-smSVM.

For the settings (a) to (c) in Sect. 2.4, we checked the performance of the BC-smSVM and smSVM together with the hmSVM and bias-corrected hmSVM (BC-hmSVM) for the kernel function (II). We set $(n_1,n_2)=(20,10)$, $d=1024\ (=2^{10})$ and $\gamma =d/4$. We set $C= 2^{-5+t}/(n_{\min }\varDelta _*), \ t=1,\ldots ,10$, for the smSVMs. Similar to Fig. 3, we calculated ${\overline{e}}$ by 2000 replications and plotted the results in Fig. 10. We observed that smSVMs give bad performances when $C<2/(n_{\min }\varDelta _*)$. As expected, the smSVMs are close to the hmSVMs when $C>2/(n_{\min }\varDelta _*)$.

Appendix B: Polynomial kernel SVM

In this section, we consider the polynomial kernel SVM; that is, the classifier (5) has the kernel function (III). We give some asymptotic properties of the polynomial kernel SVM. We consider the following conditions for $\zeta $ and r:

$$\begin{aligned} \zeta /d \in (0,\infty ) \ \hbox { and } \ r \in (0,\infty ) \ \hbox { as } d\rightarrow \infty . \end{aligned}$$

(38)

We set $\kappa _1=(\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^r$, $\kappa _2=(\zeta +{\mathrm{tr}}({{\varvec{\varSigma }}}_1)+\Vert {{\varvec{\mu }}}_1\Vert ^2)^r$, $\kappa _3=(\zeta +\Vert {{\varvec{\mu }}}_2\Vert ^2)^r$, $\kappa _4=(\zeta +{\mathrm{tr}}({{\varvec{\varSigma }}}_2)+\Vert {{\varvec{\mu }}}_2\Vert ^2)^r$ and $\kappa _5=(\zeta +{{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2)^r$. Then, we have the following result.

Proposition 8

Assume (1), (38) and (A-ii). Assume that N is fixed and

$$\begin{aligned} \liminf _{d\rightarrow \infty } \Big | \frac{ \ \Vert {{\varvec{\mu }}}_1 \Vert ^2-\Vert {{\varvec{\mu }}}_2\Vert ^2}{d}\Big | >0. \end{aligned}$$

(39)

Then, the assumptions (A-i) and (A-i’) are met for the polynomial kernel (III). Furthermore, the BC-SVM (17) with the polynomial kernel (III) holds the consistency (6).

See Fig. 5 for the performance of the BC-SVM with the polynomial kernel (III).

Remark 5

For the Laplace kernel (IV), it is difficult to provide asymptotic properties of the kernel SVM unless $\varPi _i$s are Gaussian. Detailed study of the BC-SVM with the Laplace kernel is left to a future work.

Appendix C: proofs

1.1 Proof of Lemma 1

Note that $L({\varvec{\alpha }})=\sum _{j=1}^N\alpha _j-\acute{{\varvec{\alpha }}}^\mathrm{T}{\varvec{K}}\acute{{\varvec{\alpha }}}/2$. The result is obtained from (7) straightforwardly. $\square $

1.2 Proofs of Propositions 1 and 2

We assume (A-i) and (A-i’). From Lemma 1, it holds that under (8) and (D)

$$\begin{aligned} \eta _1 \sum _{j=1}^{n_1}{\hat{\alpha }}_j^2/{\hat{\alpha }}_{\star }^2 = \eta _1/n_1+o_P(\varDelta ) \quad \hbox {and} \quad \eta _2 \sum _{j=n_1+1}^{N}{\hat{\alpha }}_j^2/ {\hat{\alpha }}_{\star }^2= \eta _2/n_2+o_P(\varDelta ), \end{aligned}$$

(40)

so that $L({\hat{{\varvec{\alpha }}}})=2{\hat{\alpha }}_{\star }-\varDelta _* {\hat{\alpha }}_{\star }^2\{1+o_P(\varDelta /\varDelta _*) \}/2$. Then, it holds that

$$\begin{aligned} {\hat{\alpha }}_{\star }=(2/{\varDelta _*})\{1+o_P(\varDelta /\varDelta _*) \}. \end{aligned}$$

(41)

Also, from (40) we have (9) under (8).

Next, we consider the second result of Proposition 1. Let ${\hat{S}}_1=\{j|{\hat{\alpha }}_j\ne 0,\ j=1,\ldots ,n_1\}$, ${\hat{S}}_2=\{j|{\hat{\alpha }}_j\ne 0,\ j=n_1+1,\ldots ,N\}$, ${\hat{n}}_{1}=\# {\hat{S}}_1$ and ${\hat{n}}_{2}=\# {\hat{S}}_2$. Then, we have that when ${{{\varvec{x}}}}_0 \in \varPi _i$ for $i=1,2$,

$$\begin{aligned}&\sum _{j=1}^N {\hat{\alpha }}_jt_jk({{{\varvec{x}}}}_0,{{{\varvec{x}}}}_j)+ \frac{1}{N_{{\hat{S}}}}\sum _{j\in {\hat{S}}} \Big (t_j- \sum _{j'\in {\hat{S}}} {\hat{\alpha }}_{j'}t_{j'}k({{{\varvec{x}}}}_j,{{{\varvec{x}}}}_{j'})\Big )\nonumber \\&=(-1)^i{\hat{\alpha }}_{\star }(\kappa _{2i-1}-\kappa _5) +\frac{{\hat{n}}_2-{\hat{n}}_1}{N_{{\hat{S}}}} \nonumber \\&\quad -\,{\hat{\alpha }}_{\star } \Big (\frac{ -\kappa _1 {\hat{n}}_1-\eta _1 +\kappa _3 {\hat{n}}_2+\eta _2+({\hat{n}}_1-{\hat{n}}_2)\kappa _5 }{N_{{\hat{S}}}} \Big ) +o_P( \varDelta {\hat{\alpha }}_{\star } ) \nonumber \\&=(-1)^i{\hat{\alpha }}_{\star }(\kappa _{2i-1}-\kappa _5) +\frac{({\hat{n}}_2-{\hat{n}}_1)(1-{\hat{\alpha }}_{\star }\varDelta _*/2 )}{N_{{\hat{S}}}}+\frac{{\hat{\alpha }}_{\star }(\kappa _1-\kappa _3) }{2} \nonumber \\&\quad +\,{\hat{\alpha }}_{\star } \frac{\eta _1/n_1-\eta _2/n_2}{2}+{\hat{\alpha }}_{\star } \frac{\eta _1(1-{\hat{n}}_1/n_1)-\eta _2(1-{\hat{n}}_2/n_2) }{N_{{\hat{S}}}}+o_P( \varDelta {\hat{\alpha }}_{\star }). \end{aligned}$$

(42)

Here, we note that $\eta _1 \sum _{j=1}^{n_1} {\hat{\alpha }}_j^2/{\hat{\alpha }}_{\star }^2 \ge \eta _1/ {\hat{n}}_1$. Thus, from (40) it holds that

$$\begin{aligned} {\hat{n}}_1 (\eta _1/{\hat{n}}_1-\eta _1/{n}_1)=\eta _1(1-{\hat{n}}_1/n_1)=o_P({\hat{n}}_1 \varDelta ) \end{aligned}$$

(43)

under (8). Similarly, we have $ \eta _2(1-{\hat{n}}_2/n_2)=o_P({\hat{n}}_2 \varDelta )$ under (8). Then, from (41) and (42), we have that when ${{{\varvec{x}}}}_0 \in \varPi _i$ for $i=1,2$,

$$\begin{aligned} {\hat{y}}({{{\varvec{x}}}}_0)&=2(-1)^i\frac{\kappa _{2i-1}-\kappa _5}{\varDelta _{*}}+\frac{\kappa _1 -\kappa _3}{\varDelta _{*}} +\frac{\eta _1/n_1-\eta _2/n_2}{\varDelta _{*}} +o_P\Big (\frac{\varDelta }{\varDelta _{*}} \Big ) \nonumber \\&=(-1)^i {\varDelta }/{\varDelta _{*}} +{\delta }/{\varDelta _{*}} +o_P({\varDelta }/{\varDelta _{*}} ) \end{aligned}$$

(44)

under (8). Hence, we conclude the second result of Proposition 1.

Finally, we consider the proof of Proposition 2. In view of (13), we claim the first result. By noting that $\varDelta _{*}/\varDelta \rightarrow 1$ and $\delta /\varDelta =o(1)$ under (12) and (D), it holds from (42) that ${\hat{y}}({{{\varvec{x}}}}_0)=(-1)^i+o_P(1)$ under (12) and (D). We conclude the second result. $\square $

1.3 Proofs of Theorem 1 and Corollary 1

We assume (A-i) and (A-i’). We consider the following conditions under (D):

$$\begin{aligned} \liminf _{d\rightarrow \infty } {\eta _{2}}/({n_{2} \varDelta })>0 \ \hbox { and } \ {\eta _{1}}/({n_{1} \varDelta })=o(1). \end{aligned}$$

(45)

Let $\varDelta _{*2}=\varDelta +\eta _{2}/n_{2}$. Note that $\eta _1 \sum _{j=1}^{n_1}{\hat{\alpha }}_j^2/{\hat{\alpha }}_{\star }^2=o_P(\varDelta )$ under (45). Similar to (41), it holds from (42) and (43) that ${\hat{\alpha }}_{\star }=(2/{\varDelta _{*2}})\{1+o_P(\varDelta /\varDelta _{*2}) \}$ and

$$\begin{aligned} {\hat{y}}({{{\varvec{x}}}}_0) =(-1)^i\varDelta /\varDelta _{*2}+\delta /\varDelta _{*2} +o_P(\varDelta /\varDelta _{*2}) \end{aligned}$$

(46)

under (45) when ${{{\varvec{x}}}}_0 \in \varPi _i$ for $i=1,2$. Note that $\varDelta _{*}/\varDelta \rightarrow 1$ and $\delta /\varDelta _{*} \rightarrow 0$ under (12) and $\varDelta _{*}/ \varDelta _{*2}\rightarrow 1$ under (45). From Propositions 1, 2 and (46), we obtain (44) under (D) without (8). Thus, from (44), we conclude the results of Theorem 1 and Corollary 1. $\square $

1.4 Proofs of Lemma 2 and Theorem 2

Under (A-i) and (D), it holds that ${\hat{\varDelta }}_*=\varDelta _*+o_P(\varDelta )$ and ${\hat{\eta }}_i=\eta _i+o_P(\varDelta )$ for $i=1,2$. Thus, we can conclude the result of Lemma 2. From the proofs of Theorem 1 and Corollary 1, we obtain (44) under (A-i) and (D). By combining (44) with Lemma 2, we conclude the result of Theorem 2. $\square $

1.5 Proofs of Lemma 3, Corollaries 2 and 3

We assume (A-ii) and (C-ii). Assume also ${{\varvec{\mu }}}_2={\varvec{0}}$ without loss of generality. Note that $\kappa _1=\Vert {{\varvec{\mu }}}_1 \Vert ^2$, $\kappa _2=\Vert {{\varvec{\mu }}}_1 \Vert ^2+{\mathrm{tr}}({{\varvec{\varSigma }}}_1)$, $\kappa _3=\kappa _5=0$, $\kappa _4=\eta _{2(I)}$ and $\varDelta _{(I)}=\Vert {{\varvec{\mu }}}_1\Vert ^2$. Also, note that

$$\begin{aligned} {{\varvec{\mu }}}_1^\mathrm{T} {{\varvec{\varSigma }}}_i{{\varvec{\mu }}}_1\le \varDelta _{(I)} \lambda _{\max }({{\varvec{\varSigma }}}_i)\le \varDelta _{(I)}{\mathrm{tr}}({{\varvec{\varSigma }}}_i^2)^{1/2}. \end{aligned}$$

(47)

Then, by using Chebyshev’s inequality, for any $\tau >0$ we have that

$$\begin{aligned}&\sum _{j=1 }^{n_1} P\left( | {{\varvec{\mu }}}_1^\mathrm{T} ({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1)| \ge \tau \varDelta _{(I)}\right) \le n_1 \left( \tau \varDelta _{(I)}\right) ^{-4} E\left[ \left\{ {{\varvec{\mu }}}_1^\mathrm{T} ({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1) \right\} ^4\right] \nonumber \\&\quad =O\left\{ n_1 \left( \left( {{\varvec{\mu }}}_1^\mathrm{T} {{\varvec{\varSigma }}}_i{{\varvec{\mu }}}_1\right) ^2+\sum _{r=1}^{p_1} \left( {\varvec{\gamma }}_{r}^\mathrm{T} {{\varvec{\mu }}}_1\right) ^4 \right) /\varDelta _{(I)}^4 \right\} =O\left( n_1 {\mathrm{tr}}\left( {{\varvec{\varSigma }}}_i^2]\right) / \varDelta _{(I)}^2 \right) \rightarrow 0 \end{aligned}$$

(48)

from the fact that $\sum _{r=1}^{p_1} ({\varvec{\gamma }}_{r}^\mathrm{T} {{\varvec{\mu }}}_1)^4\le ({{\varvec{\mu }}}_1^\mathrm{T}{{\varvec{\varSigma }}}_i{{\varvec{\mu }}}_1)^2$, where ${\varvec{\varGamma }}_1=[{\varvec{\gamma }}_1,\ldots ,{\varvec{\gamma }}_{p_1}]$. On the other hand, we have that

$$\begin{aligned}&\sum _{j<j' }^{n_i} P(|({{{\varvec{x}}}}_{ij}-{{\varvec{\mu }}}_i )^\mathrm{T}({{{\varvec{x}}}}_{ij'}-{{\varvec{\mu }}}_i)|\ge \tau \varDelta _{(I)} ) \nonumber \\&\quad \le \sum _{j<j' }^{n_i} (\tau \varDelta _{(I)})^{-4} E\left[ \{ ({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1 )^\mathrm{T} ({{{\varvec{x}}}}_{1j'}-{{\varvec{\mu }}}_1) \}^4\right] =O\left( n_i^2 {\mathrm{tr}}\left( {{\varvec{\varSigma }}}_i^2\right) ^2 /\varDelta _{(I)}^4 \right) \rightarrow 0. \end{aligned}$$

(49)

Note that ${{{\varvec{x}}}}_{1j}^\mathrm{T}{{{\varvec{x}}}}_{1j'}-\kappa _1=({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1 )^\mathrm{T} ({{{\varvec{x}}}}_{1j'}-{{\varvec{\mu }}}_1)+{{\varvec{\mu }}}_1^\mathrm{T}({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1 +{{{\varvec{x}}}}_{1j'}-{{\varvec{\mu }}}_1 )$. Thus, from (48) and (49), it holds that

$$\begin{aligned} {{{\varvec{x}}}}_{1j}^\mathrm{T}{{{\varvec{x}}}}_{1j'}=\kappa _1+o_P(\varDelta _{(I)})\ \hbox { for all } j<j' \le n_1 . \end{aligned}$$

(50)

Note that

$$\begin{aligned}&\sum _{j=1}^{n_1}\sum _{j'=1}^{n_2} P(|({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1 )^\mathrm{T}({{{\varvec{x}}}}_{2j'}-{{\varvec{\mu }}}_2)|\ge \tau \varDelta _{(I)} ) \nonumber \\&\quad =O\Big (n_1 n_2 \{{\mathrm{tr}}({{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2)\}^2+{\mathrm{tr}}({{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2{{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2) \} /\varDelta _{(I)}^4 \Big )\rightarrow 0 \end{aligned}$$

(51)

from the fact that ${\mathrm{tr}}({{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2{{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2) \le \{{\mathrm{tr}}({{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2)\}^2$. Then, similar to (50), we have that

$$\begin{aligned}&{{{\varvec{x}}}}_{2j}^\mathrm{T}{{{\varvec{x}}}}_{2j'}=\kappa _3+o_P(\varDelta _{(I)}) \hbox { for all } j<j' \le n_2,\\&{{{\varvec{x}}}}_{1j}^\mathrm{T}{{{\varvec{x}}}}_{2j'}=\kappa _5+o_P(\varDelta _{(I)}) \hbox { for all } j=1,\ldots ,n_1;\ j'=1,\ldots ,n_2,\\&{{{\varvec{x}}}}_0^\mathrm{T}{{{\varvec{x}}}}_{ij}=\kappa _{2i-1}+o_P(\varDelta _{(I)}) \hbox { for all } 1 \le j \le n_i, i=1,2, \hbox { when } {{{\varvec{x}}}}_0 \in \varPi _i\\&\hbox {and } {{{\varvec{x}}}}_0^\mathrm{T}{{{\varvec{x}}}}_{i'j}=\kappa _{5}+o_P(\varDelta _{(I)}) \hbox { for all } 1 \le j \le n_i, i=1,2\ (i'\ne i) \hbox { when } {{{\varvec{x}}}}_0 \in \varPi _i. \end{aligned}$$

In addition, for any $\tau >0$ we have that

$$\begin{aligned} \sum _{j=1 }^{n_i} P\big (\big | \Vert {{{\varvec{x}}}}_{ij}-{{\varvec{\mu }}}_i \Vert ^2-{\mathrm{tr}}({{\varvec{\varSigma }}}_i) \big |\ge \tau \varDelta _{(I)} \big ) =O\Big (n_i {\mathrm{tr}}\left( {{\varvec{\varSigma }}}_i^2\right) /\varDelta _{(I)}^2 \Big )\rightarrow 0 \end{aligned}$$

(52)

for $i=1,2$. Thus, from (48) and (52), it holds that for all $j=1,\ldots , n_i;\ i=1,2 $

$$\begin{aligned} {{{\varvec{x}}}}_{ij}^\mathrm{T}{{{\varvec{x}}}}_{ij}=\kappa _{2i}+o_P(\varDelta _{(I)}). \end{aligned}$$

It concludes Lemma 3.

For the proofs of Corollaries 2 and 3 , from Theorems 1, 2 and Corollary 1, we conclude the results. $\square $

1.6 Proofs of Lemma 4, Corollaries 4 and 5

We assume (A-ii). Let $\varOmega = \min \{\gamma \varDelta _{(II)}$$/\psi ,\ \gamma \} $. Similar to (48), for any $\tau >0$, we have that under (C-iii) and (D)

$$\begin{aligned} \sum _{j=1 }^{n_i} P(| ({{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2)^\mathrm{T}({{{\varvec{x}}}}_{ij}-{{\varvec{\mu }}}_i)| \ge \tau \varOmega ) \rightarrow 0 \end{aligned}$$

for $i=1,2$, so that $({{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2)^\mathrm{T}({{{\varvec{x}}}}_{ij}-{{\varvec{\mu }}}_i)=o_P(\varOmega )$ for all $j=1,\ldots ,n_i;\ i=1,2$. Similarly, $\Vert {{{\varvec{x}}}}_{ij}-{{\varvec{\mu }}}_i\Vert ^2={\mathrm{tr}}({{\varvec{\varSigma }}}_i)+ o_P(\varOmega )$ for all $j=1,\ldots ,n_i;\ i=1,2$, and $ ({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1)^\mathrm{T}({{{\varvec{x}}}}_{2j'}-{{\varvec{\mu }}}_2)=o_P(\varOmega )$ for all $j=1,\ldots ,n_1;\ j'=1,\ldots ,n_2$. Then, under (C-iii), we have that for all $j=1,\ldots ,n_1;\ j'=1,\ldots ,n_2$

$$\begin{aligned} \exp ( -\Vert {{{\varvec{x}}}}_{1j}-{{{\varvec{x}}}}_{2j'} \Vert ^2/\gamma )&= \exp ( -\Vert ({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1)-({{{\varvec{x}}}}_{2j'}-{{\varvec{\mu }}}_2)+{{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2 \Vert ^2/\gamma ) \nonumber \\&=\kappa _{5(II)}+o_P(\kappa _{5(II)} \varOmega /\gamma )=\kappa _{5(II)}+o_P(\varDelta _{(II)}) \end{aligned}$$

(53)

from the fact that $\kappa _{5(II)} \le \psi $. Similar to (53), we can conclude that the assumptions (A-i) and (A-i’) are met. It concludes Lemma 4.

For the proofs of Corollaries 4 and 5 , from Theorems 1, 2 and Corollary 1, we conclude the results. $\square $

1.7 Proofs of Propositions 3 and 4

From (23), (C-iii) holds under (C-ii) and (C-v). Thus, from (44) and Lemmas 2 to 4, we conclude Proposition 4. For the proof of Proposition 3, we note that ${\mathrm{tr}}({{\varvec{\varSigma }}}_i)/\gamma \rightarrow 0$ for $i=1,2,$ under (C-iv) from the fact that $\varDelta _{(I)}=O(d)$. Thus, it holds that $\psi \rightarrow 1$ and $\gamma \eta _{i(II)}=2{\mathrm{tr}}({{\varvec{\varSigma }}}_i)+O(d^2/\gamma )$ for $i=1,2,$ under (C-iv). In addition, from (22) it holds that $\delta _{(II)}/\varDelta _{(II)}=\delta _{(I)}\{1+o(1)\}/\varDelta _{(I)}+o(1)$ under (C-iv). Thus, from (44), Lemmas 3 and 4 , we conclude Proposition 3. $\square $

1.8 Proof of Proposition 5

We assume (24). Note that $1/\omega \rightarrow 0$ under (24). First, we consider the case when $\limsup _{d\rightarrow \infty } \gamma _{\star }<\infty $. Then, it holds that $F (\gamma _{\star }) = \{1+o(1)\}/\gamma _{ \star }$, so that $\liminf _{d\rightarrow \infty } F (\gamma _{\star })>0$. Next, we consider the case when $\gamma _{\star } \rightarrow \infty $. Let $\nu =\omega /\gamma _{\star }\ (>0)$. Note that $\nu =\varDelta _{(I)}/\gamma $. Then, it holds that

$$\begin{aligned} \omega F (\gamma _{\star })=\nu +\frac{2\nu \exp (-\nu ) \{1+o(\nu )\}}{ \{1-\exp (-\nu ) \}+o(\nu )}. \end{aligned}$$

Let $g(\nu )=\nu + 2\nu \exp (-\nu )/\{1-\exp (-\nu ) \}$. Note that $g(\nu )$ is a monotonically increasing function and $g(\nu )\rightarrow 2$ as $\nu \rightarrow 0$, so that $ F (\gamma _{\star })=2\{1+o(1)\}/\omega =o(1)$ when $\nu \rightarrow 0$. We can conclude the result. $\square $

1.9 Proof of Proposition 6

When $ \omega \le 1$, it holds that $ F (\gamma _{\star })=2\{1+o(1)\}/\omega $ under $\gamma _\star \rightarrow \infty $. When $\omega \le 1$ and $\gamma _\star =1$, it holds that

$$\begin{aligned} F (\gamma _{\star })=1+\frac{4}{\exp (\omega +1)+\exp (\omega -1) -2}< 1+1/\omega \le 2/\omega \end{aligned}$$

from the facts that $\exp (\omega +1)>1+(\omega +1)+(\omega +1)^2/2\ge 2+3\omega $ and $\exp (\omega -1) \ge \omega $. Hence, when $\omega \le 1$, we have that $\varDelta _{\varSigma }/\gamma _0 \in (0, \infty )$ as $d\rightarrow \infty $. It concludes the result. $\square $

1.10 Proof of Proposition 7

From Proposition 1, Lemma 2 and (11), we can conclude the results. $\square $

1.11 Proof of Proposition 8

We set that $\kappa _1=(\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^r$, $\kappa _2=(\zeta +{\mathrm{tr}}({{\varvec{\varSigma }}}_1)+\Vert {{\varvec{\mu }}}_1\Vert ^2)^r$, $\kappa _3=(\zeta +\Vert {{\varvec{\mu }}}_2\Vert ^2)^r$, $\kappa _4=(\zeta +{\mathrm{tr}}({{\varvec{\varSigma }}}_2)+\Vert {{\varvec{\mu }}}_2\Vert ^2)^r$ and $\kappa _5=(\zeta +{{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2)^r$. From (1), we note that ${{\varvec{\mu }}}_i^T{{\varvec{\varSigma }}}_{i'}{{\varvec{\mu }}}_i\le \Vert {{\varvec{\mu }}}_i \Vert ^2 \lambda _{\max }({{\varvec{\varSigma }}}_i)=o(d^2)$ as $d\rightarrow \infty $ for $i,i'=1,2$. Then, similar to (50)–(52), for the polynomial kernel, we have that ${{{\varvec{x}}}}_{ij}^\mathrm{T}{{{\varvec{x}}}}_{ij'}=\Vert {{\varvec{\mu }}}_i\Vert ^2+o_P(d)$ for all $j<j',\ i=1,2$, ${{{\varvec{x}}}}_{ij}^\mathrm{T}{{{\varvec{x}}}}_{ij}={\mathrm{tr}}({{\varvec{\varSigma }}}_i)+\Vert {{\varvec{\mu }}}_i\Vert ^2+o_P(d)$ for all i, j, and ${{{\varvec{x}}}}_{1j}^\mathrm{T}{{{\varvec{x}}}}_{2j'}={{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2+o_P(d)$ for all $j,j'$, so that $k({{{\varvec{x}}}}_{ij},{{{\varvec{x}}}}_{ij'})=\kappa _{2i-1}+o_P(d^r)$ for all $j<j',\ i=1,2$, $k({{{\varvec{x}}}}_{ij},{{{\varvec{x}}}}_{ij})=\kappa _{2i}+o_P(d^r)$ for all i, j, and $k({{{\varvec{x}}}}_{1j},{{{\varvec{x}}}}_{2j'})=\kappa _{5}+o_P(d^r)$ for all $j,j'$. Here, note that

$$\begin{aligned}&(\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^r +(\zeta + \Vert {{\varvec{\mu }}}_2\Vert ^2)^r -2(\zeta +{{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2)^r\\&\quad \ge \{(\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^{r/2}-(\zeta + \Vert {{\varvec{\mu }}}_2\Vert ^2)^{r/2} \}^2 \end{aligned}$$

from the fact that $(\zeta +{{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2)^r\le (\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^{r/2}(\zeta + \Vert {{\varvec{\mu }}}_2\Vert ^2)^{r/2} $. Then, it holds that $\liminf _{d\rightarrow \infty }\varDelta /d^r>0$ from (39). Thus, we have (A-i). Similarly, we can conclude (A-i’). From Theorem 2, the BC-SVM (17) holds (6) for the polynomial kernel. It concludes Proposition 8. $\square $

About this article

Cite this article

Nakayama, Y., Yata, K. & Aoshima, M. Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings. Ann Inst Stat Math 72, 1257–1286 (2020). https://doi.org/10.1007/s10463-019-00727-1

Download citation

Received: 28 September 2018
Revised: 15 April 2019
Published: 15 July 2019
Issue Date: October 2020
DOI: https://doi.org/10.1007/s10463-019-00727-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings

Abstract

Access this article

Similar content being viewed by others

Asymptotic properties of distance-weighted discrimination and its bias correction for high-dimension, low-sample-size data

An Empirical Comparison of Support Vector Machines Versus Nearest Neighbour Methods for Machine Learning Applications

Support Vector Machine Failure in Imbalanced Datasets

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: soft-margin SVM

Proposition 7

Appendix B: Polynomial kernel SVM

Proposition 8

Remark 5

Appendix C: proofs

1.1 Proof of Lemma 1

1.2 Proofs of Propositions 1 and 2

1.3 Proofs of Theorem 1 and Corollary 1

1.4 Proofs of Lemma 2 and Theorem 2

1.5 Proofs of Lemma 3, Corollaries 2 and 3

1.6 Proofs of Lemma 4, Corollaries 4 and 5

1.7 Proofs of Propositions 3 and 4

1.8 Proof of Proposition 5

1.9 Proof of Proposition 6

1.10 Proof of Proposition 7

1.11 Proof of Proposition 8

About this article

Cite this article

Keywords

Navigation

Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings

Abstract

Access this article

Similar content being viewed by others

Asymptotic properties of distance-weighted discrimination and its bias correction for high-dimension, low-sample-size data

An Empirical Comparison of Support Vector Machines Versus Nearest Neighbour Methods for Machine Learning Applications

Support Vector Machine Failure in Imbalanced Datasets

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: soft-margin SVM

Proposition 7

Appendix B: Polynomial kernel SVM

Proposition 8

Remark 5

Appendix C: proofs

1.1 Proof of Lemma 1

1.2 Proofs of Propositions 1 and 2

1.3 Proofs of Theorem 1 and Corollary 1

1.4 Proofs of Lemma 2 and Theorem 2

1.5 Proofs of Lemma 3, Corollaries 2 and 3

1.6 Proofs of Lemma 4, Corollaries 4 and 5

1.7 Proofs of Propositions 3 and 4

1.8 Proof of Proposition 5

1.9 Proof of Proposition 6

1.10 Proof of Proposition 7

1.11 Proof of Proposition 8

About this article

Cite this article

Share this article

Keywords

Search

Navigation