Abstract
In this paper, we consider the problem of classification of high dimensional queries to high dimensional classes from discrete alphabets where the probabilistic model that relates data to the classes is known. This problem has applications in various fields including the database search problem in mass spectrometry. The problem is analogous to the nearest neighbor search problem, where the goal is to find the data point in a database that is the most similar to a query point. The state of the art method for solving an approximate version of the nearest neighbor search problem in high dimensions is locality sensitive hashing (LSH). LSH is based on designing hash functions that map near points to the same buckets with a probability higher than random (far) points. To solve our high dimensional classification problem, we introduce distribution sensitive hashes that map jointly generated pairs to the same bucket with probability higher than random pairs. We design distribution sensitive hashes using a forest of decision trees and we analytically derive the complexity of search. We further show that the proposed hashes perform faster than state of the art approximate nearest neighbor search methods for a range of probability distributions, in both theory and simulations. Finally, we apply our method to the spectral library search problem in mass spectrometry, and show that it is an order of magnitude faster than the state of the art methods.
Similar content being viewed by others
Notes
Here, we assume that \({\mathbb {P}}(x)\) is also factorizable to i.i.d. components.
The curse of dimensionality holds for non-deterministic probability distributions. When \(p(y\mid x)\) is deterministic, i.e., when it takes only a single value with probability one, there is no curse of dimensionality. In this paper, we are interested in the case of non-deterministic probability distributions.
Note that \(d(x,y)\le R\) is equivalent to x and y being different for at most R coordinates.
In fact, to control the complexity, two more conditions are defined in Definition 2. \({{\mathbb {P}}(x,y)}\) is joint probability distribution while \({{\mathbb {P}}^{{\mathcal {A}}}(x)}\) and \({{\mathbb {P}}^{{\mathcal {B}}}(y)}\) are marginal probability distributions of \({{\mathbb {P}}(x,y)}\). \({{\mathbb {Q}}(x,y)}\) is also defined as \({{\mathbb {Q}}(x,y)}={{\mathbb {P}}^{{\mathcal {A}}}(x)}{{\mathbb {P}}^{{\mathcal {B}}}(y)}\).
If for any node the constraints for accepting as a bucket and pruning hold simultaneously, the algorithm accepts the node as a bucket.
In MinHash and LSH-Hamming, we start with \(\#bands\times \#rows\) randomly selected hashes from the family of distribution sensitive hashes, where \(\#rows\) is the number of rows and \(\#bands\) is the number of bands. We recall a pair (x, y), if x and y are hashed to the same value in all \(\#rows\) rows and in at least one of the \(\#bands\) bands.
\(\log _b Rank\) of a peak is defined as the \(\log _b\) of its rank. For any natural number n, \(\log _b Rank=n\) for the peaks at rank \(\{b^{n-1},\ldots ,b^{n}-1\}\), e.g., \(\log _2 Rank(m)=3\) for \( m \in \{4,5,6,7\}\). Joint probability distribution of logRanks for the data from Frank et al. (2011) is shown in Fig. 10a (\(4\times 4\) data matrix is obtained using \(\log _4 Rank\)), b (\(8\times 8\) data matrix is obtained using \(\log _2 Rank\)) and c (\(51\times 51\) data matrix is obtained not using any \(\log _b Rank\)).
For any natural number n, \({\mathbb {R}}^{n+}\) denotes as the set of all n-tuples non-negative real numbers.
Note that, \(V_b(G')\) satisfies bucket-list property, e.g., there is no bucket in the tree that is ancestor of another bucket.
Recall that, for simplicity we use the notation \(\sum _{i,j}\) and \(\prod _{i,j}\) instead of \(\sum _{1\le i\le k,1\le j\le l}\) and \(\prod _{1\le i\le k,1\le j\le l}\), respectively.
Note that in the cases where \(q_{ij}\) is zero, then from the definition of \(q_{ij}\), \(p_{ij}\) would also be equal to zero. Therefore, we will ignore those branches during the tree construction.
Note that \(f_1(n)=\frac{1}{n}\), \(f_2(r)=r\log r\) and \(f_3(r)=ar\) are convex functions. Therefore, as the sum of the convex functions is a convex function, the optimization problem (103)–(107) is in the form of convex optimization problem
$$\begin{aligned} \underset{\lambda }{\text{ Minimize }}~~f(x),&\end{aligned}$$(137)$$\begin{aligned} {\text{ subject } \text{ to }}~~\{g_i(x)&\le 0, i\in \{1,\ldots ,m\}, \end{aligned}$$(138)$$\begin{aligned} h_j(x)&=0, j\in \{1,\ldots ,p\}\}, \end{aligned}$$(134)where \(x\in {\mathbb {R}}^n\), f(x) and \(g_i(x)\) are convex functions and \(h_j(x)\) is affine functions.
References
Aebersold R, Mann M (2003) Mass spectrometry-based proteomics. Nature 422(6928):198–207
Anagnostopoulos E, Emiris IZ, Psarros I (2015) Low-quality dimension reduction and high-dimensional approximate nearest neighbor. In: 31st international symposium on computational geometry (SoCG 2015), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik
Andoni A, Indyk P, Laarhoven T, Razenshteyn I, Schmidt L (2015) Practical and optimal LSH for angular distance. In: Advances in neural information processing systems, pp 1225–1233
Andoni A, Laarhoven T, Razenshteyn I, Waingarten E (2017) Optimal hashing-based time-space trade-offs for approximate near neighbors. In: Proceedings of the twenty-eighth annual ACM-SIAM symposium on discrete algorithms. SIAM, pp 47–66
Andoni A, Naor A, Nikolov A, Razenshteyn I, Waingarten E (2018) Data-dependent hashing via nonlinear spectral gaps. In: Proceedings of the 50th annual ACM SIGACT symposium on theory of computing, pp 787–800
Andoni A, Razenshteyn I (2015) Optimal data-dependent hashing for approximate near neighbors. In: Proceedings of the forty-seventh annual ACM symposium on theory of computing, pp 793–801
Bawa M, Condie T, Ganesan P (2005) LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th international conference on world wide web, pp 651–660
Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: International conference on database theory. Springer, pp 217–235
Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Advances in neural information processing systems, pp 730–738
Castelli V, Li CS, Thomasian A (2000) Searching multidimensional indexes using associated clustering and dimension reduction information. U.S. Patent No. 6,134,541
Chakrabarti A, Regev O (2010) An optimal randomized cell probe lower bound for approximate nearest neighbor searching. Soc Ind Appl Math SIAM J Comput 39(5):1919–1940
Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing, pp 380–388
Choromanska AE, Langford J (2015) Logarithmic time online multiclass prediction. In: Advances in neural information processing systems, pp 55–63
Christiani T, Pagh R (2017) Set similarity search beyond MinHash. In: Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pp 1094–1107
Dasarathy BV, Sheela BV (1977) Visiting nearest neighbors—a survery of nearest neighbor pattern classification techniques. In: Proceedings of the international conference on cybernetics and society, pp 630–636
Dubiner M (2010) Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem. IEEE Trans Inf Theory 56(8):4166–4179
Dubiner M (2012) A heterogeneous high-dimensional approximate nearest neighbor algorithm. IEEE Trans Inf Theory 58(10):6646–6658
Duda RO, Hart PE, Stork DG (1973) Pattern classification and scene analysis, vol 3. Wiley, New York
Frank AM, Monroe ME, Shah AR, Carver JJ, Bandeira N, Moore RJ, Anderson GA, Smith RD, Pevzner PA (2011) Spectral archives: extending spectral libraries to analyze both identified and unidentified spectra. Nat Methods 8(7):587–591
Friedman JH, Bentley JL, Finkel RA (1977) An algorithm for finding best matches in logarithmic expected time. ACM Trans Math Softw TOMS 3(3):209–226
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: International conference on very large data bases, VLDB, vol 99, pp 518–529
Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data, pp 47–57
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, pp 604–613
Jain H, Prabhu Y, Varma M (2016) Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 935–944
Kim S, Pevzner PA (2014) MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun 5:5277
Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto
Liu W, Tsang IW (2017) Making decision trees feasible in ultrahigh feature and label dimensions. J Mach Learn Res 18(1):2814–2849
McDonald D, Hyde E, Debelius JW, Morton JT, Gonzalez A, Ackermann G, Aksenov AA, Behsaz B, Brennan C, Chen Y, Goldasich LD (2018) American Gut: an open platform for citizen science microbiome research. Msystems 3(3):e00031-18
Miltersen PB (1999) Cell probe complexity-a survey. In: Proceedings of the 19th conference on the foundations of software technology and theoretical computer science, advances in data structures workshop, p 2
Min R (2005) A non-linear dimensionality reduction method for improving nearest neighbour classification. University of Toronto, Toronto
Mori G, Belongie S, Malik J (2001) Shape contexts enable efficient retrieval of similar shapes. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, IEEE, vol 1, pp I
Nam J, Mencía EL, Kim HJ, Fürnkranz J (2017) Maximizing subset accuracy with recurrent neural networks in multi-label classification. In: Advances in neural information processing systems, pp 5413–5423
Niculescu-Mizil A, Abbasnejad E (2017) Label filters for large scale multilabel classification. In: Artificial intelligence and statistics, pp 1448–1457
Prabhu Y, Varma M (2014) Fastxml: a fast, accurate and stable tree-classifier for extreme multi-label learning. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 263–272
Rai P, Hu C, Henao R, Carin L (2015) Large-scale Bayesian multi-label learning via topic-based label embeddings. In: Advances in neural information processing systems, pp 3222–3230
Rubinstein A (2018) Hardness of approximate nearest neighbor search. In: Proceedings of the 50th annual ACM SIGACT symposium on theory of computing, pp 1260–1268
Shakhnarovich G, Viola P, Darrell T (2003) Fast pose estimation with parameter-sensitive hashing. In: Proceedings of the ninth IEEE international conference on computer vision. IEEE, vol 2, p 750
Shaw B, Jebara T (2009) Structure preserving embedding. In: Proceedings of the 26th annual international conference on machine learning, pp 937–944
Shrivastava A, Li P (2014) Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In: Advances in neural information processing systems, pp 2321–2329
Tagami Y (2017) Annexml: approximate nearest neighbor search for extreme multi-label classification. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 455–464
Yen IEH, Huang X, Ravikumar P, Zhong K, Dhillon I (2016) Pd-sparse: a primal and dual sparse approach to extreme multiclass and multilabel classification. In: International conference on machine learning, pp 3069–3077
Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: Symposium on discrete algorithms, SODA, vol 93, pp 311–321
Zhou WJ, Yu Y, Zhang ML (2017) Binary linear compression for multi-label classification. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI, pp 3546–3552
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The work of Arash Gholami Davoodi was supported by Lane fellowship and a research fellowship from Alfred P. Sloan Foundation. The work of Hosein Mohimani was supported by a research fellowship from Alfred P. Sloan Foundation and a National Institute of Health New Innovator Award DP2GM137413.
Appendices
Appendix 1: Proof of Lemmas 1 and 2
1.1 Appendix 1.1: Proof of Lemma 1
Lemma 1 is proved by induction on the depth of the node v. For example, consider v and its children \(w_{ij}=f(v,a_i,b_j)\). From (21), we have \(\varPhi (w_{ij})=\varPhi (v)p_{ij}\). Therefore, (25) is proved by induction as follows. Assume (25) holds for any node with depth less than d. Consider the node \(w_{ij}\) which is a child of v, i.e., \(w_{ij}=f(v,a_i,b_j)\) and \(depth(w_{ij})=d\).
See Sect. 4 for the Definition of \(perm_z\). Note that \({({perm_z}(x))}_d\) stands for d-th entry of the vector \({({perm_z}(x))}\). (74) follows from induction assumption for the nodes with depth less than d, (76) is a result of i.i.d. assumption (2), i.e., \({\mathbb {P}}(y\mid x)=\prod _{s=1}^S p(y_s\mid x_s)\) and the defnition of \(H_z^{{\mathcal {A}}}(x)\) in (19). [(26)–(28)] follow similarly.
1.2 Appendix 1.2: Proof of Lemma 2
Using [(25)–(28)], constraints [(11)–(14)] hold for \(\alpha =\alpha (G)\), \(\beta =\beta (G)\), \(\gamma ^{{\mathcal {A}}}=\gamma ^{{\mathcal {A}}}(G)\) and \(\gamma ^{{\mathcal {B}}}=\gamma ^{{\mathcal {B}}}(G)\). Since no two buckets are ancestor/descendant of each other, we have
Therefore, (15) holds. This completes the proof that \(H_z^{{\mathcal {A}}}(x)\) and \(H_z^{{\mathcal {B}}}(y)\) defined in (19) and (20) are \((\alpha (G),\beta (G),\gamma ^{{\mathcal {A}}}(G),\gamma ^{{\mathcal {B}}}(G))\)-sensitive.
Appendix 2: Deriving \(\mu ^*\), \(\nu ^*\), \(\eta ^*\), \(\lambda ^*\), \(p_0\), \(q_0\) and \(\delta \) for \({\mathbb {P}}\), M and N
In this section, an algorithm for deriving \(\mu ^*\), \(\nu ^*\), \(\eta ^*\), \(\lambda ^*\), \(p_0\), \(q_0\) and \(\delta \) for \({\mathbb {P}}\), M and N for the probability distribution \({\mathbb {P}}\) and \(\delta \) is presented for a given probability distribution \({\mathbb {P}}\).
Remark 8
In Algorithm 5, the parameters \(\mu ^*\), \(\nu ^*\), \(\eta ^*\) and \(\lambda ^*\) could be derived from newton method too.
Appendix 3: Proof of Theorem 2
In order to prove Theorem 2, we first state the following two lemmas.
Lemma 3
The function \(f(\theta ,\theta _1,\theta _2,\theta _3)={\theta } ^ {1+\rho _1+\rho _2+\rho _3} {\theta _1} ^{-\rho _1} {\theta _2}^ {-\rho _2} {\theta _3} ^{-\rho _3}\) is a convex function on the region \((\theta ,\theta _1,\theta _2,\theta _3)\in {\mathbb {R}}^{4+}\) where \((\rho _1,\rho _2,\rho _3)\in {\mathbb {R}}^{3+}\).Footnote 8
Lemma 4
\(\sum _{v \in V_{Buckets}(G)}{\big (\varPhi (v)\big )}^{1+\mu +\nu -\eta }{\big ({\varPsi }^{{\mathcal {A}}}(v)\big )}^ {-\mu +\eta }{\big ({\varPsi }^{{\mathcal {B}}}(v)\big )}^ {-\nu +\eta }{\big (\varPsi (v)\big )}^ {-\eta } \le 1\) for any \((\mu ,\nu ,\eta )\in {\mathcal {I}}\).
Proof of Lemma 3 and Lemma 4, are relegated to “Appendices 3.1 and 3.2”, respectively. Consider \((\mu ^ *,\nu ^ *,\eta ^*)\) that satisfy (53). For any decision tree satisfying (34)–(38), we have:
where (79) holds due to the convexity of \(f(\theta ,\theta _1,\theta _2,\theta _3)={\theta } ^ {1+\rho _1+\rho _2+\rho _3} {\theta _1} ^{-\rho _1} {\theta _2}^ {-\rho _2} {\theta _3} ^{-\rho _3}\) in Lemma 3 and (80) follows from Lemma 4. Therefore, we have
On the other hand, using (80) and the definitions of \(\alpha (G)\), \(\beta (G)\), \(\gamma ^{{\mathcal {A}}}(G)\) and \(\gamma ^{{\mathcal {B}}}(G)\) in [(29)–(32)], we have
Therefore, from (34)–(38) and (82) we have
Therefore, we conclude Theorem 2 as follows
1.1 Appendix 3.1: Proof of Lemma 3
The Hessian matrix for \(f(\theta ,\theta _1,\theta _2,\theta _3)\) is represented as
where \(W=\frac{\rho _1+\rho _2+\rho _3}{\theta }\), \(W_i=\frac{\rho _i}{\theta _i}\) for any \(i\in \{1,2,3\}\). In order to show that the function \(f(\theta ,\theta _1,\theta _2,\theta _3)\) is a convex function it is necessary and sufficient to prove that \(H(\theta ,\theta _1,\theta _2,\theta _3)\) is positive semidefinite on \({\mathbb {R}}^{4+}\) On the other hand, for positive semidefinite matrices we have
-
1.
For any non-negative scalar a and positive semidefinite matrix M, aM is positive semidefinite.
-
2.
For positive semidefinite matrices \(M_1\) and \(M_2\), \(M_1+M_2\) is positive semidefinite.
As \(f(\theta ,\theta _1,\theta _2,\theta _3)>0\) for any \(\theta ,\theta _1,\theta _2,\theta _3\), it is sufficient to prove that \(\frac{H(\theta ,\theta _1,\theta _2,\theta _3)}{f(\theta ,\theta _1,\theta _2,\theta _3)}\) is positive semidefinite. Define, \(M_1=\begin{bmatrix} W^2&{}WW_1&{}WW_2&{}WW_3\\ W_1W&{}W_1^2&{}W_1W_2&{}W_1W_3\\ W_2W&{}W_2W_1&{}W_2^2&{}W_2W_3\\ W_3W&{}W_3W_1&{}W_3W_2&{}W_3^2 \end{bmatrix}\) and
\(M_2=\begin{bmatrix} \frac{\rho _1+\rho _2+\rho _3}{\theta ^2}&{}-\frac{\rho _1}{\theta \theta _1}&{}-\frac{\rho _2}{\theta \theta _2}&{}-\frac{\rho _3}{\theta \theta _3}\\ -\frac{\rho _1}{\theta \theta _1}&{}\frac{\rho _1}{\theta _1^2}&{}0&{}0\\ -\frac{\rho _2}{\theta \theta _2}&{}0&{}\frac{\rho _2}{\theta _2^2}&{}0\\ -\frac{\rho _3}{\theta \theta _3}&{}0&{}0&{}\frac{\rho _3}{\theta _3^2} \end{bmatrix}\). Therefore, we have \(M_1+M_2=\frac{H(\theta ,\theta _1,\theta _2,\theta _3)}{f(\theta ,\theta _1,\theta _2,\theta _3)}\). In order to prove that \(f(\theta ,\theta _1,\theta _2,\theta _3)\) is positive semidefinite, it is sufficient to prove that \(M_1\) and \(M_2\) are positive semidefinites. The matrices \(M_1\) and \(M_2\) are positive semidefinite as for any non-zero vector \(z=\begin{bmatrix}a&b&c&d\end{bmatrix}\), we have \(zM_1z^T\ge 0\) and \(zM_2z^T\ge 0\), i.e.,
where (91) is concluded as \(\rho _1,\rho _2,\rho _3\ge 0\).
1.2 Appendix 3.2: Proof of Lemma 4
First, note that \(\varPsi (v)={\varPsi }^{{\mathcal {A}}}(v){\varPsi }^{{\mathcal {B}}}(v)\). Let us define
We show
by induction on the number of nodes in the tree. If the tree has only one node, i.e., root, then (93) is holds as \(\varPhi ({root})=1\), \({\varPsi }^{{\mathcal {A}}}({root})=1\) and \({\varPsi }^{{\mathcal {B}}}({root})=1\) from the definition of \(\varPhi (v)\), \({\varPsi }^{{\mathcal {A}}}(v)\) and \({\varPsi }^{{\mathcal {B}}}(v)\) in [(21)–(23)]. Assume that (93) holds for any decision tree with \(|G|<Z\). Our goal is to prove that (93) holds for a decision tree with \(|G|=Z\). Assume \(v_{11}\) is the node with maximum length in G and consider a tree \(G'\) constructed by removing \(v_{11}\) and all its siblings \(v_{ij},1\le i\le k,1\le j\le l\) belonging to the same parent v. In other words, for the tree \(G'\) we haveFootnote 9
(95) is true as for the gragh \(G'\), the node v in now a leaf node while the nodes \(v_{11},\ldots ,v_{kl}\) are removed. Then, we have
where (96) holds from the definition of tree \(G'\), (97) follows from the recursive definition of \(\varPhi (v)\), \({\varPsi }^{{\mathcal {A}}}(v)\) and \({\varPsi }^{{\mathcal {B}}}(v)\) in [(21)–(23)] and (99) holds, note that from the definition of \(\mu \), \(\nu \) and \(\eta \) in (52), i.e.,
Therefore, we conclude that
Note that, the inequality (96) becomes an equality only in cases where none of the children are pruned.
Appendix 4: Proof of Theorem 3
In order to prove Theorem 3, we first present the following lemma.
Lemma 5
Given a fixed \(N\in {\mathbb {N}}\) and probability distribution \({\mathbb {P}}\), consider the following region \(R_{\lambda }\)
Footnote 10 Then, \((\lambda ^*,r_{ij}^*,n^*)\) defined in Definition 4 is a member of \({\mathcal {R}}(\lambda ,r_{ij},n)\).
The proof of Lemma 5 is relegated to “Appendix 4.4”. Let us prove that the following tree construction steps in Algorithm 4 result in a tree that satisfies (34)–(38).
Consider the set of \(r^*_{ij} =p_{ij} ^ {1+\mu ^*+\nu ^*-\eta ^*} {(p_{i}^{{\mathcal {A}}})} ^ {-\mu ^*} {(p_j^{{\mathcal {B}}})} ^ {-\nu ^*}\) and \(n^* =\frac{(\max (1,\delta )-\lambda ^*) \log N}{\sum {r_{ij}^*\log \frac{p_{ij}}{r_{ij}^*}} }\). Note that we assume \(p_{ij}\) and \(q_{ij}\) are non-zero.Footnote 11 Consider \(n_{ij} = \lceil n^*r_{ij}^* \rceil \) if \(r^*_{ij} > \frac{1}{2}\) and \(n_{ij} = \lfloor n^*r^*_{ij} \rfloor \) if \(r^*_{ij} \le \frac{1}{2}\). Therefore, we have \(n^*-kl<\sum _{ij}n_{ij}\le n^*\). For any \(v\in V(G)\), define the set \({S}_{ij}(v)\) as follows
where depth(v) is the depth of node v in the tree, \(Seq^{{\mathcal {A}}}_s(v)\) and \(Seq^{{\mathcal {B}}}_s(v)\) stand for the character at position s in the strings \(Seq^{{\mathcal {A}}}(v)\) and \(Seq^{{\mathcal {B}}}\)(v), respectively. Now, consider a node \(v^*\) in the graph that satisfies the following constraints:
The number of nodes v that satisfy this constraint is \(\left( {\begin{array}{c}n^*\\ n_{11},\ldots ,n_{kl}\end{array}}\right) \). Moreover, define
1.1 Appendix 4.1: Node \(v^*\) or one of its ancestors is designated as a bucket by Algorithm 4
Here, we prove that the node v or one of its ancestors is designated as a bucket by Algorithm 4. In order to show this, we need to prove that:
where \(p_{0}\) and \(q_{0}\) are defined as \(\prod _{i,j}p_{ij}\) and \(\min (\prod _{i,j}q_{ij},\prod _{i}{(p_i^{{\mathcal {A}}})}^{l},\prod _j{(p_j^{{\mathcal {B}}})}^{k})\). Note that, \(\varPhi (v^*)\), \(\varPsi (v^*)\), \({\varPsi }^{{\mathcal {A}}}(v^*)\) and \({\varPsi }^{{\mathcal {B}}}(v^*)\) are computed as follows
Therefore, from Lemma 5 and [(11)–(77)] we conclude [(112)–(114)]. This means \(v^*\) or one of its ancestors is an accepted bucket.
1.2 Appendix 4.2: Proof of bounds [(35)–(38)]
First, we derive a lower bound on \(\alpha (G)\) as follows.
where \(V_{n_{11},\ldots ,n_{kl}}\) is the set of nodes that satisfies (110). \(|V_{n_{11},\ldots ,n_{kl}}|\) is lower bounded as
for some constant \(c=\frac{ (2\pi ) ^ {\frac{1 - kl}{2}}e ^ {-kl}}{(kl)!} {(n^*)} ^ {\frac{1 - kl}{2}}\) depending on n, k and l. (122) is true as for any natural number m we have \(\sqrt{2\pi m}{(\frac{m}{e})}^m\le m!< \sqrt{2\pi m}{(\frac{m}{e})}^me\). (74) follows as \(a\log \frac{1}{a}\) is an increasing function for \(0\le x\le 0.5\), and a decreasing function for \(0.5\le x\le 1\). Therefore, from (115) and (121), \({\alpha (G)} \ge N^{\max (1,\delta )-\lambda ^*}\) is concluded. Similarly, (34)–(38) are proved as follows.
where (126) and (128) are concluded from (112)–(114) and the fact that \(\frac{\sum _i{a_i}}{\sum _ib_i}\ge c\) is true for any i if \(\frac{a_i}{b_i}\ge c\) and \(b_i>0\).
1.3 Appendix 4.3: Bounding number of nodes in the tree, i.e., (34)
The number of nodes in the decision tree defined in (57), is bounded by \(O(N^{\lambda ^*})\) using the following three lemmas.
Lemma 6
For any leaf node v of the decision tree defined in (57), we have
Lemma 7
For any tree G, the summation of \(\varPhi (v)\) over all the leaf nodes is equal to one, i.e., \(\sum _{v\in V_{l}}\varPhi (v)=1\).
Lemma 8
The number of nodes in the decision tree defined in (57) is at most two times of the number of leaf nodes.
For proof of Lemmas 6, 7 and 8 , see “Appendices 4.5, 4.6 and 4.7”. Therefore, we have
where (130) follows from Lemma 7, (131) is true from (167) and (133) is concluded from Lemma 8. Therefore, we conclude that \(|V(G)|=O( N^{\lambda ^*})\).
1.4 Appendix 4.4: Proof of Lemma 5
Consider the optimization problem of finding the member of \((\lambda ,r_{ij},n)\in {\mathcal {R}}(\lambda ,r_{ij},n)\) with minimum \(\lambda \). This optimization problem is a convex optimization problem.Footnote 12 Therefore, writing the KKT conditions, we have
where
From (138), \(\mu _{1ij}\) is zero if \(r_{ij}\) is a non-zero number. Therefore, we only keep i and j where \(r_{ij}\ne 0\) and \(\mu _{1ij}=0\).
Consider the following two cases.
-
1.
\(\mu _2=0\). In this case, all the constraints are affine functions and therefore we have a linear programming problem and the feasible set of this linear programming problem is a polyhedron. From (103), the polyhedron is bounded, i.e., \(0\le r_{ij}\le M\) for some constant M. Assume that the polyhedron is nonempty, otherwise the solution is \(\infty \). Moreover, a nonempty bounded polyhedron cannot contain a line, thus it must have a basic feasible solution and the optimal solutions are restricted to the corner points.
-
2.
\(\mu _2\ne 0\). As \(r_{ij}\ne 0\) and \(\mu _{1ij}=0\), we have
$$\begin{aligned}&\frac{d F(r_{ij},n,\lambda )}{dr_{ij}} = 0 \rightarrow \mu _2 + \mu _2\log r_{ij}+ \mu _3\log p_{i}^{{\mathcal {A}}}\nonumber \\&\qquad + \mu _4\log p_j^{{\mathcal {B}}}+ \mu _5\log q_{ij}+ \mu _6\nonumber \\&\qquad -(\mu _2+\mu _3+\mu _4+\mu _5)\log p_{ij}= 0 \end{aligned}$$(149)$$\begin{aligned}&\quad \rightarrow r_{ij} ^ {\mu _2} = p_{ij} ^ {\mu _2+\mu _3+\mu _4+\mu _5} {(p_i^{{\mathcal {A}}})} ^ {-\mu _3}{(p_j^{{\mathcal {B}}})} ^ {-\mu _4}q_{ij} ^ {-\mu _5} e ^ {-\mu _2 - \mu _6} \end{aligned}$$(150)$$\begin{aligned}&\quad \rightarrow r_{ij} =cp_{ij} ^ {\frac{\mu _2+\mu _3+\mu _4+\mu _5}{\mu _2}} {(p_i^{{\mathcal {A}}})} ^ {-\frac{\mu _3}{\mu _2}} {(p_j^{{\mathcal {B}}})} ^ {-\frac{\mu _4}{\mu _2}} q_{ij} ^ {-\frac{\mu _5}{\mu _2}}, \end{aligned}$$(151)$$\begin{aligned}&\qquad \frac{d F(r_{ij},n,\lambda )}{dn} = 0 \nonumber \\&\quad \rightarrow -\left( \mu _2(\max (1,\delta )-\lambda ) +\mu _3(1-\lambda ) +\mu _4(\delta -\lambda )\right. \nonumber \\&\qquad \left. +\mu _5(1+\delta -\lambda ) \right) \frac{\log N}{n^2}=0 \end{aligned}$$(152)$$\begin{aligned}&\quad \rightarrow \mu _2(\max (1,\delta )-\lambda ) +\mu _3(1-\lambda ) \nonumber \\&\qquad +\mu _4(\delta -\lambda )+\mu _5(1+\delta -\lambda )=0 \end{aligned}$$(153)$$\begin{aligned}&\quad \rightarrow \lambda =\frac{\mu _2\max (1,\delta )+\mu _3+\mu _4\delta +\mu _5(1+\delta )}{\mu _2+\mu _3+\mu _4+\mu _5}. \end{aligned}$$(154)Summing (125), (142), (144) and (146) and using (151), we have
$$\begin{aligned}&\left( \mu _2(\max (1,\delta )-\lambda ) +\mu _3(1-\lambda ) \right. \left. +\mu _4(\delta -\lambda )+\mu _5(1+\delta -\lambda ) \right) \frac{\log N}{n}\nonumber \\&\quad =\mu _2\log c. \end{aligned}$$(155)From (154) and \(\mu _2\ne 0\), we conclude that \(c=1\). Moreover, we have \(\lambda =\frac{\max (1,\delta )+\mu +\nu \delta }{1+\mu +\nu +\eta }\), where \(\mu \), \(\nu \) and \(\eta \) are defined as:
$$\begin{aligned} \mu= & {} \frac{\mu _3+\mu _5}{\mu _2}, \end{aligned}$$(156)$$\begin{aligned} \nu= & {} \frac{\mu _4+\mu _5}{\mu _2}, \end{aligned}$$(157)$$\begin{aligned} \eta= & {} \frac{\mu _2'-\mu _5}{\mu _2}. \end{aligned}$$(158)Assume that \(\eta >0\), therefore \(r_{ij}<p_{ij}\) as \(p_{ij}\le \min (q_i,q_j)\). On the other hand, \(\sum _{i,j}{r}_{ij}=\sum p_{ij}=1\) which contradicts our assumption that \(\eta >0\). Thus, \(\eta \le 0\). Define \((\mu ^*,\nu ^*,\eta ^*)\) as follows
$$\begin{aligned} (\mu ^*,\nu ^*,\eta ^*)&=\arg \max _{\min (\mu ,\nu )\ge \eta \ge 0,\sum _{i,j}p_{ij} ^ {1+\mu +\nu -\eta } {(p_{i}^{{\mathcal {A}}})} ^ {-\mu } {(p_j^{{\mathcal {B}}})} ^ {-\nu } = 1}\nonumber \\&\quad \frac{\max (1,\delta )+\mu +\nu \delta }{1+\mu +\nu -\eta }. \end{aligned}$$(159)Therefore, from (140), (151) and (154) we conclude Lemma 5 as follows
$$\begin{aligned} r_{ij}^*= & {} p_{ij} ^ {1+\mu ^*+\nu ^*-\eta ^*} {(p_{i}^{{\mathcal {A}}})} ^ {-\mu ^*} {(p_j^{{\mathcal {B}}})} ^ {-\nu ^*}, \end{aligned}$$(160)$$\begin{aligned} \lambda ^*= & {} \frac{\max (1,\delta )+\mu ^*+\nu ^*\delta }{1+\mu ^*+\nu ^*-\eta ^*}, \end{aligned}$$(161)$$\begin{aligned} n^*= & {} \frac{(\max (1,\delta )-\lambda ^*) \log N}{\sum r_{ij}^*\log \frac{p_{ij}}{r_{ij}^*}}. \end{aligned}$$(162)
1.5 Appendix 4.5: Proof of Lemma 6
In order to prove Lemma 6, consider a leaf node \(v_0\) and its parent \(v_1\). From (57), for the node \(v_1\) we have
(163)–(165) follow from (57) and the fact that the node \(v_1\) is niether pruned nor accepted as it is not a leaf node. Therefore, from (164)\(\times \)(165)/(163), we conclude that,
Therefore, from definition of \(\varPhi (v)\), for the leaf node \(v_0\) we have
(167) is true for all the leaf nodes as (163)–(167) were derived independent of the choice of the leaf node.
1.6 Appendix 4.6: Proof of Lemma 7
The proof is straightforward based on proof of Lemma 4. \(\sum _{v\in V_{l}}\varPhi (v)=1\) is proved by the induction on depth of the tree. For the tree G with \(depth(G)=1\), \(\sum _{v\in V_{l}}\varPhi (v)=1\) is trivial as for the children \(v_{ij}\) of the root we have \(\varPhi (v_{ij})=p_{ij}\) and \(\sum _{ij}p_{ij}=1\). Assume that \(\sum _{v\in V_{l}}\varPhi (v)=1\) is true for all the trees G with \(depth(G)\le depth\). Our goal is to prove that \(\sum _{v\in V_{l}}\varPhi (v)=1\) is true for all the trees G with \(depth(G)= depth+1\). Consider a tree G with \(depth(G)= depth+1\) and the tree \(G'\) obtained by removing all the nodes at depth \( depth+1\).
(168) is a result of definition of \(G'\), i.e., \(G'\) is obtained from G by removing all the nodes at depth \( depth+1\). (169) is true as for any node w and its children \(v_{ij}\) we have \(\varPhi (w)=\sum _{ij} \varPhi (v_{ij})\) which is a result of the fact that \(\varPhi (v_{ij})=\varPhi (w)p_{ij}\). (170) is concluded from the induction assumption, i.e., \(\sum _{v\in V_{l}}\varPhi (v)=1\) is true for all the trees G with \(depth(G)\le depth\).
1.7 Appendix 4.7: Proof of Lemma 8
For any decision tree which at each node either it is pruned, accepted or branched into kl children, number of nodes in the tree is at most two times number of leaf nodes, i.e., \(|V(G)|\le 2|V_l(G)|\). This is true by induction on the depth of the tree. For a tree G with \(depth(G)=1\), we have \(|V_l(G)|=kl\) and \(|V(G)|=kl+1\). Therefore, Lemma 8 is true in this case. Assume that \(|V(G)|\le 2|V_l(G)|\) is true for all the trees G with \(depth(G)\le depth\). Our goal is to prove that \(|V(G)|\le 2|V_l(G)|\) is true for all the trees G with \(depth(G)= depth+1\). Consider a tree G with \(depth(G)=depth+1\). Consider the tree \(G'\) obtained by removing all the nodes v where \(depth(v)=depth+1\). Assume there are klr of them (each intermediate node has kl children). Therefore, we have \(|V(G)|=|V(G')|+klr\), \(|V_l(G)|=|V_l(G')|+(kl-1)r\) and \(|V(G')|\le 2|V_l(G')|\). This results in \(|V(G)|\le 2|V_l(G)|\).
Appendix 5: Proof of Theorem 4
First, note that from \(\frac{p_{ij}}{1+\epsilon }\le p'_{ij}\le {p_{ij}}{(1+\epsilon )}\), we conclude that \(\frac{p_{i}^{{\mathcal {A}}}}{1+\epsilon }\le {p'}_{i}^{{\mathcal {A}}}\le {{p}_{i}^{{\mathcal {A}}}}{(1+\epsilon )}\), \(\frac{p_j^{{\mathcal {B}}}}{1+\epsilon }\le {p'}_{j}^{{\mathcal {B}}}\le {p_j^{{\mathcal {B}}}}{(1+\epsilon )}\) and \(\frac{q_{ij}}{{{(1+\epsilon )}^2}}\le q'_{ij}\le {q_{ij}}{{{(1+\epsilon )}^2}}\). Assume that \(depth(G)=d\). Define the variables \(\varPhi (v)\), \({\varPsi }^{{\mathcal {A}}}(v)\), \({\varPsi }^{{\mathcal {B}}}(v)\), \(\varPsi (v)\), \(\alpha (G)\), \(\gamma ^{{\mathcal {A}}}(G)\), \(\gamma ^{{\mathcal {B}}}(G)\), \(\beta (G)\) and TP for the tree with the distribution \(p'\) as \(\varPhi '(v)\), \(\varPsi '^{{\mathcal {A}}}(v)\), \(\varPsi '^{{\mathcal {B}}}(v)\), \(\varPsi '(v)\), \(\alpha '(G)\), \(\gamma '^{{\mathcal {A}}}(G)\), \(\gamma '^{{\mathcal {B}}}(G)\), \(\beta '(G)\) and \(TP'\). Therefore, from (29)–(32) we have
(172) follows from \(\varPhi (v)\le \varPhi '(v)(1+\epsilon )^d\) which is a result of the definition of \(\varPhi (v)\) in (22), i.e., \(\varPhi (f(v,a_i,b_j))=\varPhi (v)p_{ij},\forall v\in V\). Similarly, we have
On the other hand, from (48)–(51), we have
Due to the inequality \((1-x)^{\frac{c}{x}}< e^{-c}\), the minimum possible value of \(\#bands\) to ensure true positive rate TP can be computed as
Thus, the total complexity is computed as
where (181) follows from (174)–(177) and the fact that the total complexity for distribution \(p'\) is \(O(N^{\lambda ^{*}(p')})\). Finally, (182) is obtained from the following lemma in which we prove that the depth of the tree is bounded by \(c_d\log N\) where \(c_d\) is a constant depending on the distribution.
Lemma 9
For the decision tree G, \(depth(G)=c_d\log N\) where \(c_d=\min \big (\frac{{(\lambda ^*-1)}}{\log (\max _{i,j}\frac{p_{ij}}{p_i^{{\mathcal {A}}}})},\frac{{(\lambda ^*-\delta )}}{\log (\max _{i,j}\frac{p_{ij}}{p_j^{{\mathcal {B}}}})}\big )\).
For proof of Lemma 9, see “Appendix 5.1”.
1.1 Appendix 5.1: Proof of Lemma 9
From (57), for the decision tree G we have
Our goal here is to prove that for any pruned node v and any accepted node v, \(depth(v)\le c_d\log N\) where \(c_d\) is a constant depending on the distribution. Consider a pruned or accepted node v. For its parent w, we have
Therefore, we conclude that
Similarly, we have
Thus, \(c_d\) is defined as \( \min \big (\frac{{(\lambda ^*-1)}}{\log (\max _{i,j}\frac{p_{ij}}{p_i^{{\mathcal {A}}}})},\frac{{(\lambda ^*-\delta )}}{\log (\max _{i,j}\frac{p_{ij}}{p_j^{{\mathcal {B}}}})}\big )\). (185) and (187) are true as \(p_0,q_0\le 1\).
Appendix 6: Pseudo code
Here, we present the pseudo code to compute the complexity of the algorithm in Dubiner (2012) in the case of hamming distance (see Experiment 1).
Appendix 7: Further discussion on MIPS
In order to use MIPS to solve this problem, i.e., (2), we need to derive optimal weights \(\omega _{ij}\) to minimize the norm \(M^2\) in Shrivastava and Li (2014). The term M stands for the radius of the space which is computed as follows: \(M^2={\mathbb {E}}{\big (||x||\big )}^2+{\mathbb {E}}{\big (||y||\big )}^2\). Therefore, from (69)–(72) we conclude that \(M^2=\sum _{ij}\left( p_j^{{\mathcal {B}}}\omega _{ij}^2+\frac{p_i^{{\mathcal {A}}}{\log ^2(\frac{p_{ij}}{q_{ij}})}}{\omega _{ij}^2}\right) \) which results in optimal \(\omega _{ij}={\big (\frac{p_i^{{\mathcal {A}}}}{p_j^{{\mathcal {B}}}}\big )}^{0.25}{\big (\mid \log \frac{p_{ij}}{q_{ij}}\mid \big )}^{0.5}\). On the other hand, for \((x,y)\sim Q(x,y)\) we have
In order to have nearly one true positive rate and sub-quadratic complexity we need \(S_0\le S d_{KL}(p_{ij}||q_{ij})\) and \(cS_0\ge -S d_{KL}(q_{ij}||p_{ij})\) where \(d_{KL}\) stands for kullback leibler divergence. Moreover, we should have \(M^2\ge S\sum _{ij}\sqrt{q_{ij}}|\log (\frac{p_{ij}}{q_{ij}})|\). Setting \(c=0\), \(S_0\) and M as above, the complexity will be more than 1.9 for any \(2\times 2\) probability distribution matrix. The reason is that the transferred data points are nearly orthogonal to each other and this makes it very slow to find maximum inner product using the existing method (Shrivastava and Li 2014).
Appendix 8: Complexities of MinHash, LSH-hamming and ForestDSH
In this section, we derive the complexities of MinHash, LSH-hamming and ForestDSH in the case of \({P}_1=\begin{bmatrix} 0.345 &{} 0 \\ 0.31 &{} 0.345 \end{bmatrix}\). Complexities are computed for any \(2\times 2\) probability distributions similarly.
1.1 Appendix 8.1: Complexity of MinHash
For MinHash the query complexity is
where \(mh_1={\frac{\log \frac{p_{00}}{1-p_{11}}}{\log \frac{q_{00}}{1-q_{11}}}}\), \(mh_2={\frac{\log \frac{p_{01}}{1-p_{10}}}{\log \frac{q_{01}}{1-q_{10}}}}\), \(mh_3={\frac{\log \frac{p_{10}}{1-p_{01}}}{\log \frac{q_{10}}{1-q_{01}}}}\) and \(mh_4={\frac{\log \frac{p_{11}}{1-p_{00}}}{\log \frac{q_{11}}{1-q_{00}}}}\). For \({P}_1\), the per query complexity is derived and is equal to 0.5207.
1.2 Appendix 8.2: Complexity of LSH-hamming
In the case of LSH-hamming, the query complexity is
and the storage required for the algorithm is \(O(N^{1+\min (\frac{\log (p_{00}+p_{11})}{\log (q_{00}+q_{11})},\frac{\log (p_{01}+p_{10})}{\log (q_{01}+q_{10})})})\). Similarly for \({P}_1\), the per query complexity is derived and is equal to 0.4672.
1.3 Appendix 8.3: Complexity of ForestDSH
From Definition 4, we derive \(\lambda ^*\) as follows
Note that \(\delta =1\) and the per query complexity is equal to 0.4384.
Appendix 9: Joint probability distributions learned on mass spectrometry data
The mass spectrometry data for experiment 4, is shown in Fig. 10a–c in case of \(\log Rank\) at base 4 (a \(4\times 4\) matrix), \(\log Rank\) at base 2 (an \(8\times 8\) matrix), and no \(\log Rank\) transformation (a \(51\times 51\) matrix). For the mass spectrometry data shown in Fig. 10a, the probability distribution \(P_{4\times 4}\) is given in (195). Note that, in the case of LSH-hamming the query complexity for these \(4\times 4\), \(8\times 8\) and \(51\times 51\) matrices are 0.901, 0.890 and 0.905, respectively. Similarly, per query complexity for MinHash for these \(4\times 4\), \(8\times 8\) and \(51\times 51\) matrices are 0.4425, 0.376 and 0.386, respectively.
For the mass spectrometry data shown in Fig. 10a, b, the probability distribution p(x, y) is represented as
From (56), for \(P_{4\times 4}\), \((\mu ^*,\nu ^*,\eta ^*,\lambda ^*)\) are derived as
Similarly, for \(P_{8\times 8}\), we have
For the mass spectrometry data shown in Fig. 10c, \((\mu ^*,\nu ^*,\eta ^*,\lambda ^*)\) are
Rights and permissions
About this article
Cite this article
Davoodi, A.G., Chang, S., Yoo, H.G. et al. ForestDSH: a universal hash design for discrete probability distributions. Data Min Knowl Disc 35, 748–795 (2021). https://doi.org/10.1007/s10618-020-00732-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-020-00732-6