Abstract
Models like support vector machines or Gaussian process regression often require positive semi-definite kernels. These kernels may be based on distance functions. While definiteness is proven for common distances and kernels, a proof for a new kernel may require too much time and effort for users who simply aim at practical usage. Furthermore, designing definite distances or kernels may be equally intricate. Finally, models can be enabled to use indefinite kernels. This may deteriorate the accuracy or computational cost of the model. Hence, an efficient method to determine definiteness is required. We propose an empirical approach. We show that sampling as well as optimization with an evolutionary algorithm may be employed to determine definiteness. We provide a proof of concept with 16 different distance measures for permutations. Our approach allows to disprove definiteness if a respective counterexample is found. It can also provide an estimate of how likely it is to obtain indefinite kernel matrices. This provides a simple, efficient tool to decide whether additional effort should be spent on designing/selecting a more suitable kernel or algorithm.
Similar content being viewed by others
Notes
The package CEGO is available on CRAN at http://cran.r-project.org/package=CEGO.
References
Bader DA, Moret BM, Warnow T, Wyman SK, Yan M, Tang J, Siepel AC, Caprara A (2004) Genome rearrangements analysis under parsimony and other phylogenetic algorithms (grappa) 2.0. https://www.cs.unm.edu/~moret/GRAPPA/. Accessed 16 Nov 2016
Bartz-Beielstein T, Zaefferer M (2017) Model-based methods for continuous and discrete global optimization. Appl Soft Comput 55:154–167
Berg C, Christensen JPR, Ressel P (1984) Harmonic analysis on semigroups, volume 100 of graduate texts in mathematics. Springer, New York
Beume N, Naujoks B, Emmerich M (2007) SMS-EMOA: multiobjective selection based on dominated hypervolume. Eur J Oper Res 181(3):1653–1669
Boytsov L (2011) Indexing methods for approximate dictionary searching: comparative analysis. J Exp Algorithmics 16:1–91
Burges CJ (1998) A tutorial on support vector machines for pattern recognition. Data Min Knowl Discov 2(2):121–167
Camastra F, Vinciarelli A (2008) Machine learning for audio, image and video analysis: theory and applications. Advanced information and knowledge processing. Springer, London
Campos V, Laguna M, Martí R (2005) Context-independent scatter and tabu search for permutation problems. INFORMS J Comput 17(1):111–122
Camps-Valls G, Martín-Guerrero JD, Rojo-Álvarez JL, Soria-Olivas E (2004) Fuzzy sigmoid kernel for support vector classifiers. Neurocomputing 62:501–506
Chen Y, Gupta MR, Recht B (2009) Learning kernels from indefinite similarities. In: Proceedings of the 26th annual international conference on machine learning (ICML ’09), New York, NY, USA. ACM, pp 145–152
Constantine G (1985) Lower bounds on the spectra of symmetric matrices with nonnegative entries. Linear Algebra Appl 65:171–178
Cortes C, Haffner P, Mohri M (2004) Rational kernels: theory and algorithms. J Mach Learn Res 5:1035–1062
Curriero F (2006) On the use of non-euclidean distance measures in geostatistics. Math Geol 38(8):907–926
Deb K, Pratap A, Agarwal S, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197
Deza M, Huang T (1998) Metrics on permutations, a survey. J Comb Inf Syst Sci 23(1–4):173–185
Eiben AE, Smith JE (2003) Introduction to evolutionary computing. Springer, Berlin
Feller W (1971) An introduction to probability theory and its applications, vol 2. Wiley, Hoboken
Forrester A, Sobester A, Keane A (2008) Engineering design via surrogate modelling. Wiley, Hoboken
Gablonsky J, Kelley C (2001) A locally-biased form of the direct algorithm. J Glob Optim 21(1):27–37
Gärtner T, Lloyd J, Flach P (2003) Kernels for structured data. In: Matwin S, Sammut C (eds) Inductive logic programming, vol 2583. Lecture Notes in Computer Science. Springer, Berlin, pp 66–83
Gärtner T, Lloyd J, Flach P (2004) Kernels and distances for structured data. Mach Learn 57(3):205–232
Haussler D (1999) Convolution kernels on discrete structures. Technical report UCSC-CRL-99-10, Department of computer science, University of California at Santa Cruz
Hirschberg DS (1975) A linear space algorithm for computing maximal common subsequences. Commun ACM 18(6):341–343
Hutter F, Hoos HH, Leyton-Brown K (2011) Sequential model-based optimization for general algorithm configuration. In Proceedings of LION-5, pp 507–523
Ikramov K, Savel’eva N (2000) Conditionally definite matrices. Journal of Mathematical Sciences 98(1):1–50
Jiao Y, Vert J.-P (2015) The Kendall and Mallows kernels for permutations. In: Proceedings of the 32nd international conference on machine learning (ICML-15), pp 1935–1944
Kendall M, Gibbons J (1990) Rank correlation methods. Oxford University Press, Oxford
Lee C (1958) Some properties of nonbinary error-correcting codes. IRE Trans Inf Theory 4(2):77–82
Li H, Jiang T (2004) A class of edit kernels for SVMS to predict translation initiation sites in eukaryotic mrnas. In: Proceedings of the eighth annual international conference on resaerch in computational molecular biology (RECOMB ’04), New York, NY, USA. ACM, pp 262–271
Loosli G, Canu S, Ong C (2015) Learning SVM in Krein spaces. IEEE Trans Pattern Anal Mach Intell 38(6):1204–1216
Marteau P-F, Gibet S (2014) On recursive edit distance kernels with application to time series classification. IEEE Trans Neural Netw Learn Syst PP(99):1–1
Moraglio A, Kattan A (2011) Geometric generalisation of surrogate model based optimisation tocombinatorial spaces. In: Proceedings of the 11th European conference on evolutionary computation in combinatorial optimization (EvoCOP’11), Berlin, Heidelberg, Germany. Springer, pp 142–154
Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge
Murphy KP (2012) Machine learning. MIT Press Ltd., Cambridge
Ong CS, Mary X, Canu S, Smola AJ (2004) Learning with non-positive kernels. In: Proceedings of the twenty-first international conference on machine learning (ICML ’04), New York, NY, USA. ACM, pp 81–88
Pawlik M, Augsten N (2015) Efficient computation of the tree edit distance. ACM Trans Database Syst 40(1):1–40
Pawlik M, Augsten N (2016) Tree edit distance: robust and memory-efficient. Inf Syst 56:157–173
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. The MIT Press, Cambridge
Reeves CR (1999) Landscapes, operators and heuristic search. Ann Oper Res 86:473–490
Schiavinotto T, Stützle T (2007) A review of metrics on permutations for search landscape analysis. Comput Oper Res 34(10):3143–3153
Schleif F-M, Tino P (2015) Indefinite proximity learning: a review. Neural Comput 27(10):2039–2096
Schleif F-M, Tino P (2017) Indefinite core vector machine. Pattern Recognit 71:187–195
Schölkopf B (2001) The kernel trick for distances. In: Leen TK, Dietterich TG, Tresp V (eds) Advances in neural information processing systems, vol 13. MIT Press, Cambridge, pp 301–307
Sevaux M, Sörensen K (2005) Permutation distance measures for memetic algorithms with population management. In: Proceedings of 6th metaheuristics international conference (MIC’05), University of Vienna, pp. 832–838
Singhal A (2001) Modern information retrieval: a brief overview. IEEE Bull Data Eng 24(4):35–43
Smola AJ, Ovári ZL, Williamson RC (2000) Regularization with dot-product kernels. In: Advances in neural information processing systems vol 13, Proceedings. MIT Press, pp 308–314
van der Loo MP (2014) The stringdist package for approximate string matching. R J 6(1):111–122
Vapnik VN (1998) Statistical learning theory, vol 1. Wiley, New York
Voutchkov I, Keane A, Bhaskar A, Olsen TM (2005) Weld sequence optimization: the use of surrogate models for solving sequential combinatorial problems. Comput Methods Appl Mech Eng 194(30–33):3535–3551
Wagner RA, Fischer MJ (1974) The string-to-string correction problem. J ACM 21(1):168–173
Wu G, Chang EY, Zhang Z (2005) An analysis of transformation on non-positive semidefinite similarity matrix for kernel machines. In: Proceedings of the 22nd international conference on machine learning
Zaefferer M, Bartz-Beielstein T (2016) Efficient global optimization with indefinite kernels. In: Parallel problem solving from nature-PPSN XIV. Springer, pp 69–79
Zaefferer M, Stork J, Bartz-Beielstein T (2014a) Distance measures for permutations in combinatorial efficient global optimization. In: Bartz-Beielstein T, Branke J, Filipič B, Smith J (eds) Parallel problem solving from nature-PPSN XIII. Springer, Cham, pp 373–383
Zaefferer M, Stork J, Friese M, Fischbach A, Naujoks B, Bartz-Beielstein T (2014b) Efficient global optimization for combinatorial problems. In: Proceedings of the 2014 conference on genetic and evolutionary computation (GECCO ’14), New York, NY, USA. ACM, pp 871–878
Zhan X (2006) Extremal eigenvalues of real symmetric matrices with entries in an interval. SIAM J Matrix Anal Appl 27(3):851–860
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Distance measures for permutations
In the following, we describe the distance measures employed in the experiments.
-
The Levenshtein distance is an edit distance measure:
\({d} _{Lev}(\pi ,\pi ') = edits_{\pi \rightarrow \pi '}\)
Here, \(edits_{\pi \rightarrow \pi '}\) is the minimal number of deletions, insertions, or substitutions required to transform one string (or here: permutation) \(\pi \) into another string \(\pi '\). The implementation is based on Wagner and Fischer (1974).
-
Swaps are transpositions of two adjacent elements. The Swap distance [also: Kendall’s Tau (Kendall and Gibbons 1990; Sevaux and Sörensen 2005) or Precedence distance (Schiavinotto and Stützle 2007)] counts the minimum number of swaps required to transform one permutation into another. For permutations, it is (Sevaux and Sörensen 2005):
$$\begin{aligned} {d} _{Swa}(\pi ,\pi ')&= \sum _{i=1}^{m} \sum _{j=1}^{m} z_{ij} ~~ \text {with}\\ z_{ij}&= \left\{ \begin{array}{l l} 1 &{} \quad \text {if } \pi _i < \pi _j ~\text {and}~ \pi '_i > \pi '_j ,\\ 0 &{} \quad \text {otherwise.} \end{array} \right. \end{aligned}$$ -
An interchange operation is the transposition of two arbitrary elements. Respectively, the Interchange (also: Cayley) distance counts the minimum number of interchanges (\(interchanges_{\pi \rightarrow \pi '}\)) required to transform one permutation into another (Schiavinotto and Stützle 2007):
\({d} _{Int}(\pi ,\pi ') = interchanges_{\pi \rightarrow \pi '}\)
-
The Insert distance is based on the longest common subsequence \(LCSeq(\pi ,\pi ')\). The longest common subsequence is the largest number of elements that follow each other in both permutations, with interruptions. The corresponding distance is
\({d} _{Ins}(\pi ,\pi ') = m-LCSeq(\pi ,\pi ').\)
We use the algorithm described by Hirschberg (1975). The name is due to its interpretation as an edit distance measure. The corresponding edit operation is a combination of insertion and deletion. A single element is moved from one position (delete) to a new position (insert). It is also called Ulam’s distance (Schiavinotto and Stützle 2007).
-
The Longest Common Substring distance is based on the largest number of elements that follow each other in both permutations, without interruption. Unlike the longest common subsequence all elements have to be adjacent. If \(LCStr(\pi ,\pi ')\) is the length of the longest common string, the distance is
$$\begin{aligned} {d} _{LCStr}(\pi ,\pi ')= m-LCStr(\pi ,\pi '). \end{aligned}$$ -
The R-distance (Campos et al. 2005; Sevaux and Sörensen 2005) counts the number of times that one element follows another in one permutation, but not in the other. It is identical with the uni-directional adjacency distance (Reeves 1999). It is computed by
$$\begin{aligned} {d} _{R}(\pi ,\pi ')&= \sum _{i=1}^{m-1} y_i ~~ \text {with}\\ y_i&= \left\{ \begin{array}{ll} 0 &{} \quad \text {if }\exists j : \pi _i=\pi '_j ~\text {and}~ \pi _{i+1}=\pi '_{j+1} ,\\ 1 &{} \quad \text {otherwise.} \end{array} \right. \end{aligned}$$ -
The (bi-directional) Adjacency distance (Reeves 1999; Schiavinotto and Stützle 2007) counts the number of times two elements are neighbors in one, but not in the other permutation. Unlike R-distance (uni-directional), the order of the two elements does not matter. It is computed by
$$\begin{aligned} {d} _{Adj}(\pi ,\pi ')&= \sum _{i=1}^{m-1} y_i ~~ \text {with}\\ y_i&= \left\{ \begin{array}{l l} 0 &{} \quad \text {if }\exists j : \pi _i=\pi '_j ~\text {and}~ \pi _{i+1} \in \{\pi '_{j+1}, \pi '_{j-1} \},\\ 1 &{} \quad \text {otherwise.} \end{array} \right. \end{aligned}$$ -
The Position distance (Schiavinotto and Stützle 2007) is identical with the Deviation distance or Spearman’s footrule (Sevaux and Sörensen 2005), \({d} _{\text {Pos}}(\pi ,\pi ') = \sum _{k=1}^{m} |i-j | ~~\text {where}~~\pi _i = \pi '_j = k\) .
-
The non-metric Squared Position distance is Spearman’s rank correlation coefficient (Sevaux and Sörensen 2005). In contrast to the Position distance, the term \(|i-j|\) is replaced by \((i-j)^2\).
-
The Hamming distance or Exact Match distance simply counts the number of unequal elements in two permutations, i.e., \({d} _{Ham}(\pi ,\pi ') = \sum _{i=1}^{m} a_i, ~~\text {where}~~ a_i = \left\{ \begin{array}{l l} 0 &{} \quad \text {if } \pi _i = \pi '_i,\\ 1 &{} \quad \text {otherwise.} \end{array} \right. \)
-
The Euclidean distance is \({d} _{Euc}(\pi ,\pi ') = \sqrt{\sum _{i=1}^{m} (\pi _i-\pi '_i)^2}\) .
-
The Manhattan distance (A-Distance, cf. (Sevaux and Sörensen 2005; Campos et al. 2005)) is \({d} _{Man}(\pi ,\pi ') = \sum _{i=1}^{m} |\pi _i-\pi '_i|\) .
-
The Chebyshev distance is \({d} _{Che}(\pi ,\pi ') = \underset{1 \le i \le m}{\max }(|\pi _i-\pi '_i|)\) .
-
For permutations, the Lee distance (Lee 1958; Deza and Huang 1998) is \({d} _{Lee}(\pi ,\pi ') = \sum _{i=1}^{m} \min (|\pi _i-\pi '_i|,m-|\pi _i-\pi '_i|)\) .
-
The non-metric Cosine distance is based on the dot product of two permutations. It is derived from the cosine similarity (Singhal 2001) of two vectors:
$$\begin{aligned} {d} _{Cos}(\pi ,\pi ') = 1 - \frac{\pi \cdot \pi '}{||\pi ||~||\pi '||}. \end{aligned}$$ -
The Lexicographic distance regards the lexicographic ordering of permutations. If the position of a permutation \(\pi \) in the lexicographic ordering of all permutations with fixed m is \(L(\pi )\), then the Lexicographic distance metric is
$$\begin{aligned} {d} _{Lex}(\pi ,\pi ') =| L(\pi ) - L(\pi ')|. \end{aligned}$$
Appendix B: Minimal examples for indefinite sets
To showcase the usefulness of the proposed methods, this section lists small example datasets and the respective indefinite distance matrices. Besides the standard permutation distances we also tested:
-
Signed permutations, reversal distance Permutations where each element has a sign are referred to as signed permutations. An application example for signed permutations is, e.g., weld path optimization (Voutchkov et al. 2005). The reversal distance counts the number of reversals required to transform one permutation into another. We used the non-cyclic reversal distance provided in the GRAPPA library version 2.0 (Bader et al. 2004).
-
Labeled trees, tree edit distance Trees in general are widely applied as solution representation, e.g., in Genetic Programming. In this study, we considered labeled trees. The tree edit distance counts the number node insertions, deletions or relabels. We used the efficient implementation in the APTED 0.1.1 library (Pawlik and Augsten 2015, 2016). The labeled trees will be denoted with the bracket notation: curly brackets indicate the tree structure, letters indicate labels (internal and terminal nodes).
-
Strings, optimal String Alignment distance (OSA) The OSA is an non-metric edit distance that counts insertions, deletions, substitutions and transpositions of characters. Each substring can be edited no more than once. It is also called the restricted Damerau-Levenshtein distance (Boytsov 2011). We used the implementation in the stringdist R-package (van der Loo 2014).
-
Strings, Jaro–Winkler distance The Jaro Winkler distance is based on the number of matching characters in two strings as well as the number of transpositions required to bring all matches in the same order. We used the implementation in the stringdist R-package (van der Loo 2014).
The respective results are listed in Table 3. All of the listed distance measures are shown to be non-CNSD.
Rights and permissions
About this article
Cite this article
Zaefferer, M., Bartz-Beielstein, T. & Rudolph, G. An empirical approach for probing the definiteness of kernels. Soft Comput 23, 10939–10952 (2019). https://doi.org/10.1007/s00500-018-3648-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-018-3648-1