Abstract
In this paper, we study asymptotic properties of nonlinear support vector machines (SVM) in high-dimension, low-sample-size settings. We propose a bias-corrected SVM (BC-SVM) which is robust against imbalanced data in a general framework. In particular, we investigate asymptotic properties of the BC-SVM having the Gaussian kernel and compare them with the ones having the linear kernel. We show that the performance of the BC-SVM is influenced by the scale parameter involved in the Gaussian kernel. We discuss a choice of the scale parameter yielding a high performance and examine the validity of the choice by numerical simulations and actual data analyses.
Similar content being viewed by others
References
Ahn, J., Marron, J. S. (2010). The maximal data piling direction for discrimination. Biometrika, 97, 254–259.
Ahn, J., Marron, J. S., Muller, K. M., Chi, Y.-Y. (2007). The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika, 94, 760–766.
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96, 6745–6750.
Aoshima, M., Yata, K. (2011). Two-stage procedures for high-dimensional data. Sequential Analysis, 30, 356–399. (Editor’s special invited paper).
Aoshima, M., Yata, K. (2014). A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data. Annals of the Institute of Statistical Mathematics, 66, 983–1010.
Aoshima, M., Yata, K. (2015). Geometric classifier for multiclass, high-dimensional data. Sequential Analysis, 34, 279–294.
Aoshima, M., Yata, K. (2018a). Two-sample tests for high-dimension, strongly spiked eigenvalue models. Statistica Sinica, 28, 43–62.
Aoshima, M., Yata, K. (2018b). High-dimensional quadratic classifiers in non-sparse settings. Methodology and Computing in Applied Probability. https://doi.org/10.1007/s11009-018-9646-z.
Aoshima, M., Yata, K. (2019). Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models. Annals of the Institute of Statistical Mathematics, 71, 473–503.
Bai, Z., Saranadasa, H. (1996). Effect of high dimension: By an example of a two sample problem. Statistica Sinica, 6, 311–329.
Benjamin, X. W., Nathalie, J. (2010). Boosting support vector machines for imbalanced data sets. Knowledge and Information Systems, 25, 1–20.
Chang, J. C., Wooten, E. C., Tsimelzon, A., Hilsenbeck, S. G., Gutierrez, M. C., Elledge, R., Mohsin, S., Osborne, C. K., Chamness, G. C., Allred, D. C., O’Connell, P. (2003). Gene expression profiling for the prediction of therapeutic response to docetaxel in patients with breast cancer. The Lancet, 362, 362–369.
Chan, Y.-B., Hall, P. (2009). Scale adjustments for classifiers in high-dimensional, low sample size settings. Biometrika, 96, 469–478.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.
Hall, P., Marron, J. S., Neeman, A. (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society, Series B, 67, 427–444.
Hall, P., Pittelkow, Y., Ghosh, M. (2008). Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes. Journal of the Royal Statistical Society, Series B, 70, 159–173.
He, H., Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21, 1263–1284.
Huang, H. (2017). Asymptotic behavior of support vector machine for spiked population model. Journal of Machine Learning Research, 18, 1–21.
Marron, J. S., Todd, M. J., Ahn, J. (2007). Distance-weighted discrimination. Journal of the American Statistical Association, 102, 1267–1271.
Nakayama, Y., Yata, K., Aoshima, M. (2017). Support vector machine and its bias correction in high-dimension, low-sample-size settings. Journal of Statistical Planning and Inference, 191, 88–100.
Nutt, C. L., Mani, D. R., Betensky, R. A., Tamayo, P., Cairncross, J. G., Ladd, C., Pohl, U., Hartmann, C., McLaughlin, M. E., Batchelor, T. T., Black, P. M., von Deimling, A., Pomeroy, S. L., Golub, T. R., Louis, D. N. (2003). Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Research, 63, 1602–1607.
Qiao, X., Zhang, L. (2015). Flexible high-dimensional classification machines and their asymptotic properties. Journal of Machine Learning Research, 16, 1547–1572.
Qiao, X., Zhang, H. H., Liu, Y., Todd, M. J., Marron, J. S. (2010). Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association, 105, 401–414.
Schölkopf, B., Smola, A. J. (2002). Learning with Kernels. Cambridge: MIT Press.
Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C., Gaasenbeek, M., et al. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine, 8, 68–74.
Vapnik, V. N. (2000). The Nature of Statistical Learning Theory2nd ed. New York: Springer.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We are very grateful to the associate editor and the reviewer for their constructive comments. The research of the second author was partially supported by Grant-in-Aid for Scientific Research (C), Japan Society for the Promotion of Science (JSPS), under Contract Number 18K03409. The research of the third author was partially supported by Grants-in-Aid for Scientific Research (A) and Challenging Research (Exploratory), JSPS, under Contract Numbers 15H01678 and 17K19956.
Appendices
Appendix A: soft-margin SVM
In Sects. 2–5, we discussed asymptotic properties and the performance of the hard-margin SVMs (hmSVM). In this section, we consider soft-margin SVMs (smSVM). The smSVM is given by \({\hat{y}}({{{\varvec{x}}}})\) after replacing (4) with
where \(C(>0)\) is a regularization parameter. Let \(n_{\min }=\min \{n_1, n_2\}\). From (11) in Sect. 2, we can asymptotically claim that \({\hat{\alpha }}_j \le 2/(\varDelta _* n_{\min })\) for all j. Thus, we consider the following condition for C:
Let \({\hat{y}}_{(S)}({{{\varvec{x}}}}_0)\) and \({\hat{y}}_{BC(S)}({{{\varvec{x}}}}_0)\) denote \({\hat{y}}({{{\varvec{x}}}}_0 )\) and \({\hat{y}}_{BC}({{{\varvec{x}}}}_0)\) after replacing (4) with (36), respectively. Then, we have the following result.
Proposition 7
Assume (A-i), (A-i’) and (8). Under (37), it holds that when \({{{\varvec{x}}}}_0\in \varPi _i\) for \(i=1,2\)
From Proposition 7, the bias-corrected smSVM (BC-smSVM) holds the consistency (6) even when \(|\delta /\varDelta |\rightarrow \infty \). Hence, for smSVMs, we recommend to use the BC-smSVM.
For the settings (a) to (c) in Sect. 2.4, we checked the performance of the BC-smSVM and smSVM together with the hmSVM and bias-corrected hmSVM (BC-hmSVM) for the kernel function (II). We set \((n_1,n_2)=(20,10)\), \(d=1024\ (=2^{10})\) and \(\gamma =d/4\). We set \(C= 2^{-5+t}/(n_{\min }\varDelta _*), \ t=1,\ldots ,10\), for the smSVMs. Similar to Fig. 3, we calculated \({\overline{e}}\) by 2000 replications and plotted the results in Fig. 10. We observed that smSVMs give bad performances when \(C<2/(n_{\min }\varDelta _*)\). As expected, the smSVMs are close to the hmSVMs when \(C>2/(n_{\min }\varDelta _*)\).
Appendix B: Polynomial kernel SVM
In this section, we consider the polynomial kernel SVM; that is, the classifier (5) has the kernel function (III). We give some asymptotic properties of the polynomial kernel SVM. We consider the following conditions for \(\zeta \) and r:
We set \(\kappa _1=(\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^r\), \(\kappa _2=(\zeta +{\mathrm{tr}}({{\varvec{\varSigma }}}_1)+\Vert {{\varvec{\mu }}}_1\Vert ^2)^r\), \(\kappa _3=(\zeta +\Vert {{\varvec{\mu }}}_2\Vert ^2)^r\), \(\kappa _4=(\zeta +{\mathrm{tr}}({{\varvec{\varSigma }}}_2)+\Vert {{\varvec{\mu }}}_2\Vert ^2)^r\) and \(\kappa _5=(\zeta +{{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2)^r\). Then, we have the following result.
Proposition 8
Assume (1), (38) and (A-ii). Assume that N is fixed and
Then, the assumptions (A-i) and (A-i’) are met for the polynomial kernel (III). Furthermore, the BC-SVM (17) with the polynomial kernel (III) holds the consistency (6).
See Fig. 5 for the performance of the BC-SVM with the polynomial kernel (III).
Remark 5
For the Laplace kernel (IV), it is difficult to provide asymptotic properties of the kernel SVM unless \(\varPi _i\)s are Gaussian. Detailed study of the BC-SVM with the Laplace kernel is left to a future work.
Appendix C: proofs
1.1 Proof of Lemma 1
Note that \(L({\varvec{\alpha }})=\sum _{j=1}^N\alpha _j-\acute{{\varvec{\alpha }}}^\mathrm{T}{\varvec{K}}\acute{{\varvec{\alpha }}}/2\). The result is obtained from (7) straightforwardly. \(\square \)
1.2 Proofs of Propositions 1 and 2
We assume (A-i) and (A-i’). From Lemma 1, it holds that under (8) and (D)
so that \(L({\hat{{\varvec{\alpha }}}})=2{\hat{\alpha }}_{\star }-\varDelta _* {\hat{\alpha }}_{\star }^2\{1+o_P(\varDelta /\varDelta _*) \}/2\). Then, it holds that
Also, from (40) we have (9) under (8).
Next, we consider the second result of Proposition 1. Let \({\hat{S}}_1=\{j|{\hat{\alpha }}_j\ne 0,\ j=1,\ldots ,n_1\}\), \({\hat{S}}_2=\{j|{\hat{\alpha }}_j\ne 0,\ j=n_1+1,\ldots ,N\}\), \({\hat{n}}_{1}=\# {\hat{S}}_1\) and \({\hat{n}}_{2}=\# {\hat{S}}_2\). Then, we have that when \({{{\varvec{x}}}}_0 \in \varPi _i\) for \(i=1,2\),
Here, we note that \(\eta _1 \sum _{j=1}^{n_1} {\hat{\alpha }}_j^2/{\hat{\alpha }}_{\star }^2 \ge \eta _1/ {\hat{n}}_1\). Thus, from (40) it holds that
under (8). Similarly, we have \( \eta _2(1-{\hat{n}}_2/n_2)=o_P({\hat{n}}_2 \varDelta )\) under (8). Then, from (41) and (42), we have that when \({{{\varvec{x}}}}_0 \in \varPi _i\) for \(i=1,2\),
under (8). Hence, we conclude the second result of Proposition 1.
Finally, we consider the proof of Proposition 2. In view of (13), we claim the first result. By noting that \(\varDelta _{*}/\varDelta \rightarrow 1\) and \(\delta /\varDelta =o(1)\) under (12) and (D), it holds from (42) that \({\hat{y}}({{{\varvec{x}}}}_0)=(-1)^i+o_P(1)\) under (12) and (D). We conclude the second result. \(\square \)
1.3 Proofs of Theorem 1 and Corollary 1
We assume (A-i) and (A-i’). We consider the following conditions under (D):
Let \(\varDelta _{*2}=\varDelta +\eta _{2}/n_{2}\). Note that \(\eta _1 \sum _{j=1}^{n_1}{\hat{\alpha }}_j^2/{\hat{\alpha }}_{\star }^2=o_P(\varDelta )\) under (45). Similar to (41), it holds from (42) and (43) that \({\hat{\alpha }}_{\star }=(2/{\varDelta _{*2}})\{1+o_P(\varDelta /\varDelta _{*2}) \}\) and
under (45) when \({{{\varvec{x}}}}_0 \in \varPi _i\) for \(i=1,2\). Note that \(\varDelta _{*}/\varDelta \rightarrow 1\) and \(\delta /\varDelta _{*} \rightarrow 0\) under (12) and \(\varDelta _{*}/ \varDelta _{*2}\rightarrow 1\) under (45). From Propositions 1, 2 and (46), we obtain (44) under (D) without (8). Thus, from (44), we conclude the results of Theorem 1 and Corollary 1. \(\square \)
1.4 Proofs of Lemma 2 and Theorem 2
Under (A-i) and (D), it holds that \({\hat{\varDelta }}_*=\varDelta _*+o_P(\varDelta )\) and \({\hat{\eta }}_i=\eta _i+o_P(\varDelta )\) for \(i=1,2\). Thus, we can conclude the result of Lemma 2. From the proofs of Theorem 1 and Corollary 1, we obtain (44) under (A-i) and (D). By combining (44) with Lemma 2, we conclude the result of Theorem 2. \(\square \)
1.5 Proofs of Lemma 3, Corollaries 2 and 3
We assume (A-ii) and (C-ii). Assume also \({{\varvec{\mu }}}_2={\varvec{0}}\) without loss of generality. Note that \(\kappa _1=\Vert {{\varvec{\mu }}}_1 \Vert ^2\), \(\kappa _2=\Vert {{\varvec{\mu }}}_1 \Vert ^2+{\mathrm{tr}}({{\varvec{\varSigma }}}_1)\), \(\kappa _3=\kappa _5=0\), \(\kappa _4=\eta _{2(I)}\) and \(\varDelta _{(I)}=\Vert {{\varvec{\mu }}}_1\Vert ^2\). Also, note that
Then, by using Chebyshev’s inequality, for any \(\tau >0\) we have that
from the fact that \(\sum _{r=1}^{p_1} ({\varvec{\gamma }}_{r}^\mathrm{T} {{\varvec{\mu }}}_1)^4\le ({{\varvec{\mu }}}_1^\mathrm{T}{{\varvec{\varSigma }}}_i{{\varvec{\mu }}}_1)^2\), where \({\varvec{\varGamma }}_1=[{\varvec{\gamma }}_1,\ldots ,{\varvec{\gamma }}_{p_1}]\). On the other hand, we have that
Note that \({{{\varvec{x}}}}_{1j}^\mathrm{T}{{{\varvec{x}}}}_{1j'}-\kappa _1=({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1 )^\mathrm{T} ({{{\varvec{x}}}}_{1j'}-{{\varvec{\mu }}}_1)+{{\varvec{\mu }}}_1^\mathrm{T}({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1 +{{{\varvec{x}}}}_{1j'}-{{\varvec{\mu }}}_1 )\). Thus, from (48) and (49), it holds that
Note that
from the fact that \({\mathrm{tr}}({{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2{{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2) \le \{{\mathrm{tr}}({{\varvec{\varSigma }}}_1{{\varvec{\varSigma }}}_2)\}^2\). Then, similar to (50), we have that
In addition, for any \(\tau >0\) we have that
for \(i=1,2\). Thus, from (48) and (52), it holds that for all \(j=1,\ldots , n_i;\ i=1,2 \)
It concludes Lemma 3.
For the proofs of Corollaries 2 and 3 , from Theorems 1, 2 and Corollary 1, we conclude the results. \(\square \)
1.6 Proofs of Lemma 4, Corollaries 4 and 5
We assume (A-ii). Let \(\varOmega = \min \{\gamma \varDelta _{(II)}\)\(/\psi ,\ \gamma \} \). Similar to (48), for any \(\tau >0\), we have that under (C-iii) and (D)
for \(i=1,2\), so that \(({{\varvec{\mu }}}_1-{{\varvec{\mu }}}_2)^\mathrm{T}({{{\varvec{x}}}}_{ij}-{{\varvec{\mu }}}_i)=o_P(\varOmega )\) for all \(j=1,\ldots ,n_i;\ i=1,2\). Similarly, \(\Vert {{{\varvec{x}}}}_{ij}-{{\varvec{\mu }}}_i\Vert ^2={\mathrm{tr}}({{\varvec{\varSigma }}}_i)+ o_P(\varOmega )\) for all \(j=1,\ldots ,n_i;\ i=1,2\), and \( ({{{\varvec{x}}}}_{1j}-{{\varvec{\mu }}}_1)^\mathrm{T}({{{\varvec{x}}}}_{2j'}-{{\varvec{\mu }}}_2)=o_P(\varOmega )\) for all \(j=1,\ldots ,n_1;\ j'=1,\ldots ,n_2\). Then, under (C-iii), we have that for all \(j=1,\ldots ,n_1;\ j'=1,\ldots ,n_2\)
from the fact that \(\kappa _{5(II)} \le \psi \). Similar to (53), we can conclude that the assumptions (A-i) and (A-i’) are met. It concludes Lemma 4.
For the proofs of Corollaries 4 and 5 , from Theorems 1, 2 and Corollary 1, we conclude the results. \(\square \)
1.7 Proofs of Propositions 3 and 4
From (23), (C-iii) holds under (C-ii) and (C-v). Thus, from (44) and Lemmas 2 to 4, we conclude Proposition 4. For the proof of Proposition 3, we note that \({\mathrm{tr}}({{\varvec{\varSigma }}}_i)/\gamma \rightarrow 0\) for \(i=1,2,\) under (C-iv) from the fact that \(\varDelta _{(I)}=O(d)\). Thus, it holds that \(\psi \rightarrow 1\) and \(\gamma \eta _{i(II)}=2{\mathrm{tr}}({{\varvec{\varSigma }}}_i)+O(d^2/\gamma )\) for \(i=1,2,\) under (C-iv). In addition, from (22) it holds that \(\delta _{(II)}/\varDelta _{(II)}=\delta _{(I)}\{1+o(1)\}/\varDelta _{(I)}+o(1)\) under (C-iv). Thus, from (44), Lemmas 3 and 4 , we conclude Proposition 3. \(\square \)
1.8 Proof of Proposition 5
We assume (24). Note that \(1/\omega \rightarrow 0\) under (24). First, we consider the case when \(\limsup _{d\rightarrow \infty } \gamma _{\star }<\infty \). Then, it holds that \(F (\gamma _{\star }) = \{1+o(1)\}/\gamma _{ \star }\), so that \(\liminf _{d\rightarrow \infty } F (\gamma _{\star })>0\). Next, we consider the case when \(\gamma _{\star } \rightarrow \infty \). Let \(\nu =\omega /\gamma _{\star }\ (>0)\). Note that \(\nu =\varDelta _{(I)}/\gamma \). Then, it holds that
Let \(g(\nu )=\nu + 2\nu \exp (-\nu )/\{1-\exp (-\nu ) \}\). Note that \(g(\nu )\) is a monotonically increasing function and \(g(\nu )\rightarrow 2\) as \(\nu \rightarrow 0\), so that \( F (\gamma _{\star })=2\{1+o(1)\}/\omega =o(1)\) when \(\nu \rightarrow 0\). We can conclude the result. \(\square \)
1.9 Proof of Proposition 6
When \( \omega \le 1\), it holds that \( F (\gamma _{\star })=2\{1+o(1)\}/\omega \) under \(\gamma _\star \rightarrow \infty \). When \(\omega \le 1\) and \(\gamma _\star =1\), it holds that
from the facts that \(\exp (\omega +1)>1+(\omega +1)+(\omega +1)^2/2\ge 2+3\omega \) and \(\exp (\omega -1) \ge \omega \). Hence, when \(\omega \le 1\), we have that \(\varDelta _{\varSigma }/\gamma _0 \in (0, \infty )\) as \(d\rightarrow \infty \). It concludes the result. \(\square \)
1.10 Proof of Proposition 7
From Proposition 1, Lemma 2 and (11), we can conclude the results. \(\square \)
1.11 Proof of Proposition 8
We set that \(\kappa _1=(\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^r\), \(\kappa _2=(\zeta +{\mathrm{tr}}({{\varvec{\varSigma }}}_1)+\Vert {{\varvec{\mu }}}_1\Vert ^2)^r\), \(\kappa _3=(\zeta +\Vert {{\varvec{\mu }}}_2\Vert ^2)^r\), \(\kappa _4=(\zeta +{\mathrm{tr}}({{\varvec{\varSigma }}}_2)+\Vert {{\varvec{\mu }}}_2\Vert ^2)^r\) and \(\kappa _5=(\zeta +{{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2)^r\). From (1), we note that \({{\varvec{\mu }}}_i^T{{\varvec{\varSigma }}}_{i'}{{\varvec{\mu }}}_i\le \Vert {{\varvec{\mu }}}_i \Vert ^2 \lambda _{\max }({{\varvec{\varSigma }}}_i)=o(d^2)\) as \(d\rightarrow \infty \) for \(i,i'=1,2\). Then, similar to (50)–(52), for the polynomial kernel, we have that \({{{\varvec{x}}}}_{ij}^\mathrm{T}{{{\varvec{x}}}}_{ij'}=\Vert {{\varvec{\mu }}}_i\Vert ^2+o_P(d)\) for all \(j<j',\ i=1,2\), \({{{\varvec{x}}}}_{ij}^\mathrm{T}{{{\varvec{x}}}}_{ij}={\mathrm{tr}}({{\varvec{\varSigma }}}_i)+\Vert {{\varvec{\mu }}}_i\Vert ^2+o_P(d)\) for all i, j, and \({{{\varvec{x}}}}_{1j}^\mathrm{T}{{{\varvec{x}}}}_{2j'}={{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2+o_P(d)\) for all \(j,j'\), so that \(k({{{\varvec{x}}}}_{ij},{{{\varvec{x}}}}_{ij'})=\kappa _{2i-1}+o_P(d^r)\) for all \(j<j',\ i=1,2\), \(k({{{\varvec{x}}}}_{ij},{{{\varvec{x}}}}_{ij})=\kappa _{2i}+o_P(d^r)\) for all i, j, and \(k({{{\varvec{x}}}}_{1j},{{{\varvec{x}}}}_{2j'})=\kappa _{5}+o_P(d^r)\) for all \(j,j'\). Here, note that
from the fact that \((\zeta +{{\varvec{\mu }}}_1^T{{\varvec{\mu }}}_2)^r\le (\zeta + \Vert {{\varvec{\mu }}}_1\Vert ^2)^{r/2}(\zeta + \Vert {{\varvec{\mu }}}_2\Vert ^2)^{r/2} \). Then, it holds that \(\liminf _{d\rightarrow \infty }\varDelta /d^r>0\) from (39). Thus, we have (A-i). Similarly, we can conclude (A-i’). From Theorem 2, the BC-SVM (17) holds (6) for the polynomial kernel. It concludes Proposition 8. \(\square \)
About this article
Cite this article
Nakayama, Y., Yata, K. & Aoshima, M. Bias-corrected support vector machine with Gaussian kernel in high-dimension, low-sample-size settings. Ann Inst Stat Math 72, 1257–1286 (2020). https://doi.org/10.1007/s10463-019-00727-1
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-019-00727-1