Skip to main content
Log in

ForestDSH: a universal hash design for discrete probability distributions

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

In this paper, we consider the problem of classification of high dimensional queries to high dimensional classes from discrete alphabets where the probabilistic model that relates data to the classes is known. This problem has applications in various fields including the database search problem in mass spectrometry. The problem is analogous to the nearest neighbor search problem, where the goal is to find the data point in a database that is the most similar to a query point. The state of the art method for solving an approximate version of the nearest neighbor search problem in high dimensions is locality sensitive hashing (LSH). LSH is based on designing hash functions that map near points to the same buckets with a probability higher than random (far) points. To solve our high dimensional classification problem, we introduce distribution sensitive hashes that map jointly generated pairs to the same bucket with probability higher than random pairs. We design distribution sensitive hashes using a forest of decision trees and we analytically derive the complexity of search. We further show that the proposed hashes perform faster than state of the art approximate nearest neighbor search methods for a range of probability distributions, in both theory and simulations. Finally, we apply our method to the spectral library search problem in mass spectrometry, and show that it is an order of magnitude faster than the state of the art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Here, we assume that \({\mathbb {P}}(x)\) is also factorizable to i.i.d. components.

  2. The curse of dimensionality holds for non-deterministic probability distributions. When \(p(y\mid x)\) is deterministic, i.e., when it takes only a single value with probability one, there is no curse of dimensionality. In this paper, we are interested in the case of non-deterministic probability distributions.

  3. Note that \(d(x,y)\le R\) is equivalent to x and y being different for at most R coordinates.

  4. In fact, to control the complexity, two more conditions are defined in Definition 2. \({{\mathbb {P}}(x,y)}\) is joint probability distribution while \({{\mathbb {P}}^{{\mathcal {A}}}(x)}\) and \({{\mathbb {P}}^{{\mathcal {B}}}(y)}\) are marginal probability distributions of \({{\mathbb {P}}(x,y)}\). \({{\mathbb {Q}}(x,y)}\) is also defined as \({{\mathbb {Q}}(x,y)}={{\mathbb {P}}^{{\mathcal {A}}}(x)}{{\mathbb {P}}^{{\mathcal {B}}}(y)}\).

  5. If for any node the constraints for accepting as a bucket and pruning hold simultaneously, the algorithm accepts the node as a bucket.

  6. In MinHash and LSH-Hamming, we start with \(\#bands\times \#rows\) randomly selected hashes from the family of distribution sensitive hashes, where \(\#rows\) is the number of rows and \(\#bands\) is the number of bands. We recall a pair (xy), if x and y are hashed to the same value in all \(\#rows\) rows and in at least one of the \(\#bands\) bands.

  7. \(\log _b Rank\) of a peak is defined as the \(\log _b\) of its rank. For any natural number n, \(\log _b Rank=n\) for the peaks at rank \(\{b^{n-1},\ldots ,b^{n}-1\}\), e.g., \(\log _2 Rank(m)=3\) for \( m \in \{4,5,6,7\}\). Joint probability distribution of logRanks for the data from Frank et al. (2011) is shown in Fig. 10a (\(4\times 4\) data matrix is obtained using \(\log _4 Rank\)), b (\(8\times 8\) data matrix is obtained using \(\log _2 Rank\)) and c (\(51\times 51\) data matrix is obtained not using any \(\log _b Rank\)).

  8. For any natural number n, \({\mathbb {R}}^{n+}\) denotes as the set of all n-tuples non-negative real numbers.

  9. Note that, \(V_b(G')\) satisfies bucket-list property, e.g., there is no bucket in the tree that is ancestor of another bucket.

  10. Recall that, for simplicity we use the notation \(\sum _{i,j}\) and \(\prod _{i,j}\) instead of \(\sum _{1\le i\le k,1\le j\le l}\) and \(\prod _{1\le i\le k,1\le j\le l}\), respectively.

  11. Note that in the cases where \(q_{ij}\) is zero, then from the definition of \(q_{ij}\), \(p_{ij}\) would also be equal to zero. Therefore, we will ignore those branches during the tree construction.

  12. Note that \(f_1(n)=\frac{1}{n}\), \(f_2(r)=r\log r\) and \(f_3(r)=ar\) are convex functions. Therefore, as the sum of the convex functions is a convex function, the optimization problem (103)–(107) is in the form of convex optimization problem

    $$\begin{aligned} \underset{\lambda }{\text{ Minimize }}~~f(x),&\end{aligned}$$
    (137)
    $$\begin{aligned} {\text{ subject } \text{ to }}~~\{g_i(x)&\le 0, i\in \{1,\ldots ,m\}, \end{aligned}$$
    (138)
    $$\begin{aligned} h_j(x)&=0, j\in \{1,\ldots ,p\}\}, \end{aligned}$$
    (134)

    where \(x\in {\mathbb {R}}^n\), f(x) and \(g_i(x)\) are convex functions and \(h_j(x)\) is affine functions.

References

  • Aebersold R, Mann M (2003) Mass spectrometry-based proteomics. Nature 422(6928):198–207

    Article  Google Scholar 

  • Anagnostopoulos E, Emiris IZ, Psarros I (2015) Low-quality dimension reduction and high-dimensional approximate nearest neighbor. In: 31st international symposium on computational geometry (SoCG 2015), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik

  • Andoni A, Indyk P, Laarhoven T, Razenshteyn I, Schmidt L (2015) Practical and optimal LSH for angular distance. In: Advances in neural information processing systems, pp 1225–1233

  • Andoni A, Laarhoven T, Razenshteyn I, Waingarten E (2017) Optimal hashing-based time-space trade-offs for approximate near neighbors. In: Proceedings of the twenty-eighth annual ACM-SIAM symposium on discrete algorithms. SIAM, pp 47–66

  • Andoni A, Naor A, Nikolov A, Razenshteyn I, Waingarten E (2018) Data-dependent hashing via nonlinear spectral gaps. In: Proceedings of the 50th annual ACM SIGACT symposium on theory of computing, pp 787–800

  • Andoni A, Razenshteyn I (2015) Optimal data-dependent hashing for approximate near neighbors. In: Proceedings of the forty-seventh annual ACM symposium on theory of computing, pp 793–801

  • Bawa M, Condie T, Ganesan P (2005) LSH forest: self-tuning indexes for similarity search. In: Proceedings of the 14th international conference on world wide web, pp 651–660

  • Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517

    Article  Google Scholar 

  • Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: International conference on database theory. Springer, pp 217–235

  • Bhatia K, Jain H, Kar P, Varma M, Jain P (2015) Sparse local embeddings for extreme multi-label classification. In: Advances in neural information processing systems, pp 730–738

  • Castelli V, Li CS, Thomasian A (2000) Searching multidimensional indexes using associated clustering and dimension reduction information. U.S. Patent No. 6,134,541

  • Chakrabarti A, Regev O (2010) An optimal randomized cell probe lower bound for approximate nearest neighbor searching. Soc Ind Appl Math SIAM J Comput 39(5):1919–1940

    MathSciNet  MATH  Google Scholar 

  • Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing, pp 380–388

  • Choromanska AE, Langford J (2015) Logarithmic time online multiclass prediction. In: Advances in neural information processing systems, pp 55–63

  • Christiani T, Pagh R (2017) Set similarity search beyond MinHash. In: Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, pp 1094–1107

  • Dasarathy BV, Sheela BV (1977) Visiting nearest neighbors—a survery of nearest neighbor pattern classification techniques. In: Proceedings of the international conference on cybernetics and society, pp 630–636

  • Dubiner M (2010) Bucketing coding and information theory for the statistical high-dimensional nearest-neighbor problem. IEEE Trans Inf Theory 56(8):4166–4179

    Article  MathSciNet  Google Scholar 

  • Dubiner M (2012) A heterogeneous high-dimensional approximate nearest neighbor algorithm. IEEE Trans Inf Theory 58(10):6646–6658

    Article  MathSciNet  Google Scholar 

  • Duda RO, Hart PE, Stork DG (1973) Pattern classification and scene analysis, vol 3. Wiley, New York

    MATH  Google Scholar 

  • Frank AM, Monroe ME, Shah AR, Carver JJ, Bandeira N, Moore RJ, Anderson GA, Smith RD, Pevzner PA (2011) Spectral archives: extending spectral libraries to analyze both identified and unidentified spectra. Nat Methods 8(7):587–591

    Article  Google Scholar 

  • Friedman JH, Bentley JL, Finkel RA (1977) An algorithm for finding best matches in logarithmic expected time. ACM Trans Math Softw TOMS 3(3):209–226

    Article  Google Scholar 

  • Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: International conference on very large data bases, VLDB, vol 99, pp 518–529

  • Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD international conference on management of data, pp 47–57

  • Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, pp 604–613

  • Jain H, Prabhu Y, Varma M (2016) Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 935–944

  • Kim S, Pevzner PA (2014) MS-GF+ makes progress towards a universal database search tool for proteomics. Nat Commun 5:5277

    Article  Google Scholar 

  • Krizhevsky A (2009) Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto

  • Liu W, Tsang IW (2017) Making decision trees feasible in ultrahigh feature and label dimensions. J Mach Learn Res 18(1):2814–2849

    MathSciNet  MATH  Google Scholar 

  • McDonald D, Hyde E, Debelius JW, Morton JT, Gonzalez A, Ackermann G, Aksenov AA, Behsaz B, Brennan C, Chen Y, Goldasich LD (2018) American Gut: an open platform for citizen science microbiome research. Msystems 3(3):e00031-18

    Article  Google Scholar 

  • Miltersen PB (1999) Cell probe complexity-a survey. In: Proceedings of the 19th conference on the foundations of software technology and theoretical computer science, advances in data structures workshop, p 2

  • Min R (2005) A non-linear dimensionality reduction method for improving nearest neighbour classification. University of Toronto, Toronto

    Google Scholar 

  • Mori G, Belongie S, Malik J (2001) Shape contexts enable efficient retrieval of similar shapes. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, IEEE, vol 1, pp I

  • Nam J, Mencía EL, Kim HJ, Fürnkranz J (2017) Maximizing subset accuracy with recurrent neural networks in multi-label classification. In: Advances in neural information processing systems, pp 5413–5423

  • Niculescu-Mizil A, Abbasnejad E (2017) Label filters for large scale multilabel classification. In: Artificial intelligence and statistics, pp 1448–1457

  • Prabhu Y, Varma M (2014) Fastxml: a fast, accurate and stable tree-classifier for extreme multi-label learning. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 263–272

  • Rai P, Hu C, Henao R, Carin L (2015) Large-scale Bayesian multi-label learning via topic-based label embeddings. In: Advances in neural information processing systems, pp 3222–3230

  • Rubinstein A (2018) Hardness of approximate nearest neighbor search. In: Proceedings of the 50th annual ACM SIGACT symposium on theory of computing, pp 1260–1268

  • Shakhnarovich G, Viola P, Darrell T (2003) Fast pose estimation with parameter-sensitive hashing. In: Proceedings of the ninth IEEE international conference on computer vision. IEEE, vol 2, p 750

  • Shaw B, Jebara T (2009) Structure preserving embedding. In: Proceedings of the 26th annual international conference on machine learning, pp 937–944

  • Shrivastava A, Li P (2014) Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In: Advances in neural information processing systems, pp 2321–2329

  • Tagami Y (2017) Annexml: approximate nearest neighbor search for extreme multi-label classification. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 455–464

  • Yen IEH, Huang X, Ravikumar P, Zhong K, Dhillon I (2016) Pd-sparse: a primal and dual sparse approach to extreme multiclass and multilabel classification. In: International conference on machine learning, pp 3069–3077

  • Yianilos PN (1993) Data structures and algorithms for nearest neighbor search in general metric spaces. In: Symposium on discrete algorithms, SODA, vol 93, pp 311–321

  • Zhou WJ, Yu Y, Zhang ML (2017) Binary linear compression for multi-label classification. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI, pp 3546–3552

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arash Gholami Davoodi.

Additional information

Responsible editor: Ira Assent, Carlotta Domeniconi, Aristides Gionis, Eyke Hüllermeier.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The work of Arash Gholami Davoodi was supported by Lane fellowship and a research fellowship from Alfred P. Sloan Foundation. The work of Hosein Mohimani was supported by a research fellowship from Alfred P. Sloan Foundation and a National Institute of Health New Innovator Award DP2GM137413.

Appendices

Appendix 1: Proof of Lemmas 1 and 2

1.1 Appendix 1.1: Proof of Lemma 1

Lemma 1 is proved by induction on the depth of the node v. For example, consider v and its children \(w_{ij}=f(v,a_i,b_j)\). From (21), we have \(\varPhi (w_{ij})=\varPhi (v)p_{ij}\). Therefore, (25) is proved by induction as follows. Assume (25) holds for any node with depth less than d. Consider the node \(w_{ij}\) which is a child of v, i.e., \(w_{ij}=f(v,a_i,b_j)\) and \(depth(w_{ij})=d\).

$$\begin{aligned} \varPhi (w_{ij})= & {} \varPhi (v)p_{ij} \end{aligned}$$
(73)
$$\begin{aligned}= & {} Prob(v\in H_z^{{\mathcal {A}}}(x)\cap H_z^{{\mathcal {B}}}(y)\mid (x,y)\sim {\mathbb {P}})p_{ij} \end{aligned}$$
(74)
$$\begin{aligned}= & {} Prob\big (Seq^{{\mathcal {A}}}(v) ~\text{ is } \text{ a } \text{ prefix } \text{ of } \text{ perm}_z(x),\nonumber \\&Seq^{{\mathcal {B}}}(v) ~\text{ is } \text{ a } \text{ prefix } \text{ of } \text{ perm}_z(y)\mid (x,y)\sim {\mathbb {P}}\big )p_{ij} \end{aligned}$$
(75)
$$\begin{aligned}= & {} Prob\big (Seq^{{\mathcal {A}}}(v) ~\text{ is } \text{ a } \text{ prefix } \text{ of } \text{ perm}_z(x),\nonumber \\&Seq^{{\mathcal {B}}}(v) ~\text{ is } \text{ a } \text{ prefix } \text{ of } \text{ perm}_z(y),a_i={({perm_z}(x))}_d,b_j \nonumber \\= & {} {{({perm_z}(y))}_d}\mid (x,y)\sim {\mathbb {P}}\big )\nonumber \\ \end{aligned}$$
(76)
$$\begin{aligned}= & {} Prob(w_{ij}\in H_z^{{\mathcal {A}}}(x)\cap H_z^{{\mathcal {B}}}(y)\mid (x,y)\sim {\mathbb {P}}). \end{aligned}$$
(77)

See Sect. 4 for the Definition of \(perm_z\). Note that \({({perm_z}(x))}_d\) stands for d-th entry of the vector \({({perm_z}(x))}\). (74) follows from induction assumption for the nodes with depth less than d, (76) is a result of i.i.d. assumption (2), i.e., \({\mathbb {P}}(y\mid x)=\prod _{s=1}^S p(y_s\mid x_s)\) and the defnition of \(H_z^{{\mathcal {A}}}(x)\) in (19). [(26)–(28)] follow similarly.

1.2 Appendix 1.2: Proof of Lemma 2

Using [(25)–(28)], constraints [(11)–(14)] hold for \(\alpha =\alpha (G)\), \(\beta =\beta (G)\), \(\gamma ^{{\mathcal {A}}}=\gamma ^{{\mathcal {A}}}(G)\) and \(\gamma ^{{\mathcal {B}}}=\gamma ^{{\mathcal {B}}}(G)\). Since no two buckets are ancestor/descendant of each other, we have

$$\begin{aligned} {\mid H^{{\mathcal {A}}}(x)\cap H^{{\mathcal {B}}}(y)\mid } \le 1. \end{aligned}$$
(78)

Therefore, (15) holds. This completes the proof that \(H_z^{{\mathcal {A}}}(x)\) and \(H_z^{{\mathcal {B}}}(y)\) defined in (19) and (20) are \((\alpha (G),\beta (G),\gamma ^{{\mathcal {A}}}(G),\gamma ^{{\mathcal {B}}}(G))\)-sensitive.

Appendix 2: Deriving \(\mu ^*\), \(\nu ^*\), \(\eta ^*\), \(\lambda ^*\), \(p_0\), \(q_0\) and \(\delta \) for \({\mathbb {P}}\), M and N

In this section, an algorithm for deriving \(\mu ^*\), \(\nu ^*\), \(\eta ^*\), \(\lambda ^*\), \(p_0\), \(q_0\) and \(\delta \) for \({\mathbb {P}}\), M and N for the probability distribution \({\mathbb {P}}\) and \(\delta \) is presented for a given probability distribution \({\mathbb {P}}\).

figure e

Remark 8

In Algorithm 5, the parameters \(\mu ^*\), \(\nu ^*\), \(\eta ^*\) and \(\lambda ^*\) could be derived from newton method too.

Appendix 3: Proof of Theorem 2

In order to prove Theorem 2, we first state the following two lemmas.

Lemma 3

The function \(f(\theta ,\theta _1,\theta _2,\theta _3)={\theta } ^ {1+\rho _1+\rho _2+\rho _3} {\theta _1} ^{-\rho _1} {\theta _2}^ {-\rho _2} {\theta _3} ^{-\rho _3}\) is a convex function on the region \((\theta ,\theta _1,\theta _2,\theta _3)\in {\mathbb {R}}^{4+}\) where \((\rho _1,\rho _2,\rho _3)\in {\mathbb {R}}^{3+}\).Footnote 8

Lemma 4

\(\sum _{v \in V_{Buckets}(G)}{\big (\varPhi (v)\big )}^{1+\mu +\nu -\eta }{\big ({\varPsi }^{{\mathcal {A}}}(v)\big )}^ {-\mu +\eta }{\big ({\varPsi }^{{\mathcal {B}}}(v)\big )}^ {-\nu +\eta }{\big (\varPsi (v)\big )}^ {-\eta } \le 1\) for any \((\mu ,\nu ,\eta )\in {\mathcal {I}}\).

Proof of Lemma 3 and Lemma 4, are relegated to “Appendices 3.1 and 3.2”, respectively. Consider \((\mu ^ *,\nu ^ *,\eta ^*)\) that satisfy (53). For any decision tree satisfying (34)–(38), we have:

$$\begin{aligned}&{\left( \frac{\sum _{v \in V_{Buckets}(G)} {\varPhi (v)}}{|V_{Buckets}(G)|}\right) }^ {1+\mu ^*+\nu ^*-\eta ^*} {\left( \frac{\sum _{v \in V_{Buckets}(G)} {{\varPsi }^{{\mathcal {A}}}(v)}}{|V_{Buckets}(G)|}\right) }^{-\mu ^ *+\eta ^*}\nonumber \\&\times {\left( \frac{\sum _{v \in V_{Buckets}(G)} {{\varPsi }^{{\mathcal {B}}}(v)}}{|V_{Buckets}(G)|}\right) }^{-\nu ^ *+\eta ^*} {\left( \frac{\sum _{v \in V_{Buckets}(G)}{\varPsi (v)}}{|V_{Buckets}(G)|}\right) }^{-\eta ^ *} \nonumber \\&\quad \le \frac{ \sum _{v \in V_{Buckets}(G)}{\big (\varPhi (v)\big )}^{1+\mu ^*+\nu ^*-\eta ^*}{\big ({\varPsi }^{{\mathcal {A}}}(v)\big )}^ {-\mu ^ *+\eta ^*}{\big ({\varPsi }^{{\mathcal {B}}}(v)\big )}^ {-\nu ^ *+\eta ^*}{\big (\varPsi (v)\big )}^ {-\eta ^ *}}{|V_{Buckets}(G)|} \nonumber \\\end{aligned}$$
(79)
$$\begin{aligned}&\quad \le \frac{1}{|V_{Buckets}(G)|}, \end{aligned}$$
(80)

where (79) holds due to the convexity of \(f(\theta ,\theta _1,\theta _2,\theta _3)={\theta } ^ {1+\rho _1+\rho _2+\rho _3} {\theta _1} ^{-\rho _1} {\theta _2}^ {-\rho _2} {\theta _3} ^{-\rho _3}\) in Lemma  3 and (80) follows from Lemma 4. Therefore, we have

$$\begin{aligned}&{\left( {\sum _{v \in V_{Buckets}(G)}{\varPhi (v)}}\right) }^ {1+\mu ^*+\nu ^*-\eta ^*} {\left( {\sum _{v \in V_{Buckets}(G)}{{\varPsi }^{{\mathcal {A}}}(v)}}\right) }^{-\mu ^ *+\eta ^*} \nonumber \\&\quad \times {\left( {\sum _{v \in V_{Buckets}(G)}{{\varPsi }^{{\mathcal {B}}}(v)}}\right) }^{-\nu ^ *+\eta ^*} {\left( {\sum _{v \in V_{Buckets}(G)}{\varPsi (v)}}\right) }^{-\eta ^ *} \nonumber \\&\qquad \le 1. \end{aligned}$$
(81)

On the other hand, using (80) and the definitions of \(\alpha (G)\), \(\beta (G)\), \(\gamma ^{{\mathcal {A}}}(G)\) and \(\gamma ^{{\mathcal {B}}}(G)\) in [(29)–(32)], we have

$$\begin{aligned} {\big (\alpha (G)\big )}^ {1+\mu ^*+\nu ^*-\eta ^*} {\big (\gamma ^{{\mathcal {A}}}(G)\big )}^{-\mu ^ *+\eta ^*} {\big (\gamma ^{{\mathcal {B}}}(G)\big )}^{-\nu ^ *+\eta ^*} {\big (\beta (G)\big )}^{-\eta ^ *}\le & {} 1. \end{aligned}$$
(82)

Therefore, from (34)–(38) and (82) we have

$$\begin{aligned}&1\nonumber \\&\quad \ge {\big (\alpha (G)\big )}^ {1+\mu ^*+\nu ^*-\eta ^*} {\big (\gamma ^{{\mathcal {A}}}(G)\big )}^{-\mu ^ *+\eta ^*} {\big (\gamma ^{{\mathcal {B}}}(G)\big )}^{-\nu ^ *+\eta ^*} {\big (\beta (G)\big )}^{-\eta ^ *} \nonumber \\&\quad = \alpha (G){\left( \frac{\alpha (G)}{{\gamma ^{{\mathcal {A}}}(G)}}\right) }^{\mu ^*-\eta ^*} {\left( \frac{\alpha (G)}{{\gamma ^{{\mathcal {B}}}(G)}}\right) }^{\nu ^*-\eta ^*} {\left( \frac{\alpha (G)}{\beta (G)}\right) }^{\eta ^*} \end{aligned}$$
(83)
$$\begin{aligned}&\quad \ge \left( N^{\max (1,\delta )-\lambda }\right) {\left( N^{1-\lambda }\right) }^{\mu ^*-\eta ^*} {\left( N^{\delta -\lambda }\right) }^{\nu ^*-\eta ^*} {\left( N^{1+\delta -\lambda }\right) }^{\eta ^*} \end{aligned}$$
(84)
$$\begin{aligned}&\quad =N^{\max (1,\delta )+\mu ^*+\delta \nu ^*-(1+\mu ^*+\nu ^*-\eta ^*)\lambda }. \end{aligned}$$
(85)

Therefore, we conclude Theorem 2 as follows

$$\begin{aligned} \lambda\ge & {} \lambda ^*=\frac{\max (1,\delta )+\mu ^*+\delta \nu ^*}{1+\mu ^*+\nu ^*-\eta ^*}. \end{aligned}$$
(86)

1.1 Appendix 3.1: Proof of Lemma 3

The Hessian matrix for \(f(\theta ,\theta _1,\theta _2,\theta _3)\) is represented as

$$\begin{aligned}&H(\theta ,\theta _1,\theta _2,\theta _3)\nonumber \\&\quad =f(\theta ,\theta _1,\theta _2,\theta _3)\nonumber \\&\qquad \times \begin{bmatrix} \frac{(1+\rho _1+\rho _2+\rho _3)(\rho _1+\rho _2+\rho _3)}{\theta ^2}&{}\frac{-\rho _1(1+\rho _1+\rho _2+\rho _3)}{\theta \theta _1}&{}\frac{-\rho _2(1+\rho _1+\rho _2+\rho _3)}{\theta \theta _2}&{}\frac{-\rho _3(1+\rho _1+\rho _2+\rho _3)}{\theta \theta _3}\\ \frac{-\rho _1(1+\rho _1+\rho _2+\rho _3)}{\theta \theta _1}&{}\frac{\rho _1(\rho _1+1)}{\theta _1^2}&{}\frac{\rho _1\rho _2}{\theta _1\theta _2}&{}\frac{\rho _1\rho _3}{\theta _1\theta _3}\\ \frac{-\rho _2(1+\rho _1+\rho _2+\rho _3)}{\theta \theta _2}&{}\frac{\rho _1\rho _2}{\theta _1\theta _2}&{}\frac{\rho _2(\rho _2+1)}{\theta _2^2}&{}\frac{\rho _2\rho _3}{\theta _2\theta _3}\\ \frac{-\rho _3(1+\rho _1+\rho _2+\rho _3)}{\theta \theta _3}&{}\frac{\rho _1\rho _3}{\theta _1\theta _3}&{}\frac{\rho _2\rho _3}{\theta _2\theta _3}&{}\frac{\rho _3(\rho _3+1)}{\theta _3^2} \end{bmatrix}\nonumber \\&\quad =f(\theta ,\theta _1,\theta _2,\theta _3)\nonumber \\&\qquad \times \begin{bmatrix} W^2+\frac{\rho _1+\rho _2+\rho _3}{\theta ^2}&{}WW_1-\frac{\rho _1}{\theta \theta _1}&{}WW_2-\frac{\rho _2}{\theta \theta _2}&{}WW_3-\frac{\rho _3}{\theta \theta _3}\\ W_1W-\frac{\rho _1}{\theta \theta _1}&{}W_1^2+\frac{\rho _1}{\theta _1^2}&{}W_1W_2&{}W_1W_3\\ W_2W-\frac{\rho _2}{\theta \theta _2}&{}W_2W_1&{}W_2^2+\frac{\rho _2}{\theta _2^2}&{}W_2W_3\\ W_3W-\frac{\rho _3}{\theta \theta _3}&{}W_3W_1&{}W_3W_2&{}W_3^2+\frac{\rho _3}{\theta _3^2} \end{bmatrix}, \end{aligned}$$
(87)

where \(W=\frac{\rho _1+\rho _2+\rho _3}{\theta }\), \(W_i=\frac{\rho _i}{\theta _i}\) for any \(i\in \{1,2,3\}\). In order to show that the function \(f(\theta ,\theta _1,\theta _2,\theta _3)\) is a convex function it is necessary and sufficient to prove that \(H(\theta ,\theta _1,\theta _2,\theta _3)\) is positive semidefinite on \({\mathbb {R}}^{4+}\) On the other hand, for positive semidefinite matrices we have

  1. 1.

    For any non-negative scalar a and positive semidefinite matrix M, aM is positive semidefinite.

  2. 2.

    For positive semidefinite matrices \(M_1\) and \(M_2\), \(M_1+M_2\) is positive semidefinite.

As \(f(\theta ,\theta _1,\theta _2,\theta _3)>0\) for any \(\theta ,\theta _1,\theta _2,\theta _3\), it is sufficient to prove that \(\frac{H(\theta ,\theta _1,\theta _2,\theta _3)}{f(\theta ,\theta _1,\theta _2,\theta _3)}\) is positive semidefinite. Define, \(M_1=\begin{bmatrix} W^2&{}WW_1&{}WW_2&{}WW_3\\ W_1W&{}W_1^2&{}W_1W_2&{}W_1W_3\\ W_2W&{}W_2W_1&{}W_2^2&{}W_2W_3\\ W_3W&{}W_3W_1&{}W_3W_2&{}W_3^2 \end{bmatrix}\) and

\(M_2=\begin{bmatrix} \frac{\rho _1+\rho _2+\rho _3}{\theta ^2}&{}-\frac{\rho _1}{\theta \theta _1}&{}-\frac{\rho _2}{\theta \theta _2}&{}-\frac{\rho _3}{\theta \theta _3}\\ -\frac{\rho _1}{\theta \theta _1}&{}\frac{\rho _1}{\theta _1^2}&{}0&{}0\\ -\frac{\rho _2}{\theta \theta _2}&{}0&{}\frac{\rho _2}{\theta _2^2}&{}0\\ -\frac{\rho _3}{\theta \theta _3}&{}0&{}0&{}\frac{\rho _3}{\theta _3^2} \end{bmatrix}\). Therefore, we have \(M_1+M_2=\frac{H(\theta ,\theta _1,\theta _2,\theta _3)}{f(\theta ,\theta _1,\theta _2,\theta _3)}\). In order to prove that \(f(\theta ,\theta _1,\theta _2,\theta _3)\) is positive semidefinite, it is sufficient to prove that \(M_1\) and \(M_2\) are positive semidefinites. The matrices \(M_1\) and \(M_2\) are positive semidefinite as for any non-zero vector \(z=\begin{bmatrix}a&b&c&d\end{bmatrix}\), we have \(zM_1z^T\ge 0\) and \(zM_2z^T\ge 0\), i.e.,

$$\begin{aligned}&\begin{bmatrix}a&b&c&d\end{bmatrix}\begin{bmatrix} W^2&{}WW_1&{}WW_2&{}WW_3\\ W_1W&{}W_1^2&{}W_1W_2&{}W_1W_3\\ W_2W&{}W_2W_1&{}W_2^2&{}W_2W_3\\ W_3W&{}W_3W_1&{}W_3W_2&{}W_3^2 \end{bmatrix}\begin{bmatrix}a\\ b\\ c\\ d\end{bmatrix}\nonumber \\&\quad =(Wa+W_1b+W_2c+W_3d)^2\end{aligned}$$
(88)
$$\begin{aligned}\ge & {} 0, \end{aligned}$$
(89)
$$\begin{aligned}&\begin{bmatrix}a&b&c&d\end{bmatrix}\begin{bmatrix} \frac{\rho _1+\rho _2+\rho _3}{\theta ^2}&{}-\frac{\rho _1}{\theta \theta _1}&{}-\frac{\rho _2}{\theta \theta _2}&{}\frac{\rho _3}{\theta \theta _3}\\ -\frac{\rho _1}{\theta \theta _1}&{}\frac{\rho _1}{\theta _1^2}&{}0&{}0\\ -\frac{\rho _2}{\theta \theta _2}&{}0&{}\frac{\rho _2}{\theta _2^2}&{}0\\ -\frac{\rho _3}{\theta \theta _3}&{}0&{}0&{}\frac{\rho _3}{\theta _3^2} \end{bmatrix}\begin{bmatrix}a\\ b\\ c\\ d\end{bmatrix}\nonumber \\&\quad =\rho _1{\big (\frac{a}{\theta }-\frac{b}{\theta _1}\big )}^2+\rho _2{\big (\frac{a}{\theta }-\frac{c}{\theta _2}\big )}^2\nonumber \\&\qquad +\rho _3{\big (\frac{a}{\theta }-\frac{d}{\theta _3}\big )}^2 \end{aligned}$$
(90)
$$\begin{aligned}&\quad \ge 0, \end{aligned}$$
(91)

where (91) is concluded as \(\rho _1,\rho _2,\rho _3\ge 0\).

1.2 Appendix 3.2: Proof of Lemma 4

First, note that \(\varPsi (v)={\varPsi }^{{\mathcal {A}}}(v){\varPsi }^{{\mathcal {B}}}(v)\). Let us define

$$\begin{aligned} D(v)=\sum _{v \in V_{Buckets}(G)}\varPhi (v) ^ {1+\mu +\nu -\eta }{\big ({\varPsi }^{{\mathcal {A}}}(v)\big )}^ {-\mu }{\big ({\varPsi }^{{\mathcal {B}}}(v)\big )}^ {-\nu }. \end{aligned}$$
(92)

We show

$$\begin{aligned} \sum _{v\in V_{Buckets}(G)}D(v)\le 1 , \end{aligned}$$
(93)

by induction on the number of nodes in the tree. If the tree has only one node, i.e., root, then (93) is holds as \(\varPhi ({root})=1\), \({\varPsi }^{{\mathcal {A}}}({root})=1\) and \({\varPsi }^{{\mathcal {B}}}({root})=1\) from the definition of \(\varPhi (v)\), \({\varPsi }^{{\mathcal {A}}}(v)\) and \({\varPsi }^{{\mathcal {B}}}(v)\) in [(21)–(23)]. Assume that (93) holds for any decision tree with \(|G|<Z\). Our goal is to prove that (93) holds for a decision tree with \(|G|=Z\). Assume \(v_{11}\) is the node with maximum length in G and consider a tree \(G'\) constructed by removing \(v_{11}\) and all its siblings \(v_{ij},1\le i\le k,1\le j\le l\) belonging to the same parent v. In other words, for the tree \(G'\) we haveFootnote 9

$$\begin{aligned} V(G')= & {} V(G)/\{v_{11},\ldots ,v_{kl}\}, \end{aligned}$$
(94)
$$\begin{aligned} V_{Buckets}(G')= & {} \left( V_{Buckets}(G)\cup \{v\}\right) /\{v_{11},\ldots ,v_{kl}\}. \end{aligned}$$
(95)

(95) is true as for the gragh \(G'\), the node v in now a leaf node while the nodes \(v_{11},\ldots ,v_{kl}\) are removed. Then, we have

$$\begin{aligned}&\sum _{v\in V_{Buckets}(G)}D(v)\nonumber \\&\quad \le \sum _{v\in V_{Buckets}(G')}D(v)-\varPhi (v) ^ {1+\mu +\nu -\eta }{\big ({\varPsi }^{{\mathcal {A}}}(v)\big )}^ {-\mu }{\big ({\varPsi }^{{\mathcal {B}}}(v)\big )}^ {-\nu }\nonumber \\&\qquad +\sum _{i,j}\varPhi (v_{ij}) ^ {1+\mu +\nu -\eta }{{\varPsi }^{{\mathcal {A}}}(v_{ij})}^ {-\mu }{{\varPsi }^{{\mathcal {B}}}(v_{ij})}^ {-\nu } \end{aligned}$$
(96)
$$\begin{aligned}&\quad =\sum _{\in V_{Buckets}(G')}D(v)-\varPhi (v) ^ {1+\mu +\nu -\eta }{\big ({\varPsi }^{{\mathcal {A}}}(v)\big )}^ {-\mu }{\big ({\varPsi }^{{\mathcal {B}}}(v)\big )}^ {-\nu }\nonumber \\&\qquad +\sum _{i,j}\left( \varPhi (v) ^ {1+\mu +\nu -\eta }{\big ({\varPsi }^{{\mathcal {A}}}(v)\big )}^ {-\mu +\eta }{\big ({\varPsi }^{{\mathcal {B}}}(v)\big )}^ {-\nu }\right. \nonumber \\&\qquad \left. \times {p_{ij}} ^ {1+\mu +\nu -\eta }{(p_{i}^{{\mathcal {A}}})}^ {-\mu }{(p_{i}^{{\mathcal {B}}})}^ {-\nu }\right) \end{aligned}$$
(97)
$$\begin{aligned}&\quad =\sum _{v\in V_{Buckets}(G')}D(v)-\varPhi (v) ^ {1+\mu +\nu -\eta }{\big ({\varPsi }^{{\mathcal {A}}}(v)\big )}^ {-\mu }{\big ({\varPsi }^{{\mathcal {B}}}(v)\big )}^ {-\nu }\nonumber \\&\qquad \times \left( 1-\sum _{i,j}{p_{ij}} ^ {1+\mu +\nu -\eta }{(p_i^{{\mathcal {A}}})}^ {-\mu }{(p_j^{{\mathcal {B}}})}^ {-\nu }\right) \end{aligned}$$
(98)
$$\begin{aligned}&\quad =\sum _{v\in V_{Buckets}(G')}D(v) \end{aligned}$$
(99)
$$\begin{aligned}&\quad =1, \end{aligned}$$
(100)

where (96) holds from the definition of tree \(G'\), (97) follows from the recursive definition of \(\varPhi (v)\), \({\varPsi }^{{\mathcal {A}}}(v)\) and \({\varPsi }^{{\mathcal {B}}}(v)\) in [(21)–(23)] and (99) holds, note that from the definition of \(\mu \), \(\nu \) and \(\eta \) in (52), i.e.,

$$\begin{aligned} \sum _{i,j}{p_{ij}} ^ {1+\mu +\nu -\eta }{(p_i^{{\mathcal {A}}})}^ {-\mu }{(p_j^{{\mathcal {B}}})}^ {-\nu }=1. \end{aligned}$$
(101)

Therefore, we conclude that

$$\begin{aligned} \sum _{v\in V_{Buckets}(G)}D(v)\le & {} 1. \end{aligned}$$
(102)

Note that, the inequality (96) becomes an equality only in cases where none of the children are pruned.

Appendix 4: Proof of Theorem 3

In order to prove Theorem 3, we first present the following lemma.

Lemma 5

Given a fixed \(N\in {\mathbb {N}}\) and probability distribution \({\mathbb {P}}\), consider the following region \(R_{\lambda }\)

$$\begin{aligned} {\mathcal {R}}(\lambda ,r_{ij},n)=\left\{ \lambda ,r_{ij},n~\text{ s.t. } ~\lambda \ge 0, \sum _{i,j}{r_{ij}} \right.= & {} 1, r_{ij} \ge 0,r_{ij} \in {\mathbb {R}}^+,n \in {\mathbb {R}}^+, \nonumber \\\end{aligned}$$
(103)
$$\begin{aligned} \sum _{i,j}{r_{ij}\log p_{ij}}-\sum _{i,j}{r_{ij}\log r_{ij}}\ge & {} \frac{(\max (1,\delta )-\lambda ) \log N}{n}, \end{aligned}$$
(104)
$$\begin{aligned} \sum _{i,j}{r_{ij}\log p_{ij}}-\sum _{i,j}{r_{ij}\log p_{i}^{{\mathcal {A}}}}\ge & {} \frac{(1-\lambda ) \log N}{n}, \end{aligned}$$
(105)
$$\begin{aligned} \sum _{i,j}{r_{ij}\log p_{ij}}-\sum _{i,j}{r_{ij}\log p_j^{{\mathcal {B}}}}\ge & {} \frac{(\delta -\lambda ) \log N}{n}, \end{aligned}$$
(106)
$$\begin{aligned} \sum _{i,j}{r_{ij}\log p_{ij}}-\sum _{i,j}{r_{ij}\log q_{ij}}\ge & {} \left. \frac{(1+\delta -\lambda ) \log N}{n} \right\} \end{aligned}$$
(107)

Footnote 10 Then, \((\lambda ^*,r_{ij}^*,n^*)\) defined in Definition 4 is a member of \({\mathcal {R}}(\lambda ,r_{ij},n)\).

The proof of Lemma 5 is relegated to “Appendix 4.4”. Let us prove that the following tree construction steps in Algorithm 4 result in a tree that satisfies (34)–(38).

$$\begin{aligned} \left\{ \begin{array}{cl}\frac{\varPhi (v)}{\varPsi (v)} \ge {N ^ {1+\delta -\lambda ^*}}{p_0q_0}&{}: {accept~bucket},\\ \frac{\varPhi (v)}{{\varPsi }^{{\mathcal {A}}}(v)}\le N^{1-\lambda ^*}{p_0q_0}&{}: prune,\\ \frac{\varPhi (v)}{{\varPsi }^{{\mathcal {B}}}(v)}\le N^{\delta -\lambda ^*}{p_0q_0}&{}: prune,\\ otherwise&{}: {branch~into~the~kl~children.} \end{array}\right. {} \end{aligned}$$
(108)

Consider the set of \(r^*_{ij} =p_{ij} ^ {1+\mu ^*+\nu ^*-\eta ^*} {(p_{i}^{{\mathcal {A}}})} ^ {-\mu ^*} {(p_j^{{\mathcal {B}}})} ^ {-\nu ^*}\) and \(n^* =\frac{(\max (1,\delta )-\lambda ^*) \log N}{\sum {r_{ij}^*\log \frac{p_{ij}}{r_{ij}^*}} }\). Note that we assume \(p_{ij}\) and \(q_{ij}\) are non-zero.Footnote 11 Consider \(n_{ij} = \lceil n^*r_{ij}^* \rceil \) if \(r^*_{ij} > \frac{1}{2}\) and \(n_{ij} = \lfloor n^*r^*_{ij} \rfloor \) if \(r^*_{ij} \le \frac{1}{2}\). Therefore, we have \(n^*-kl<\sum _{ij}n_{ij}\le n^*\). For any \(v\in V(G)\), define the set \({S}_{ij}(v)\) as follows

$$ \begin{aligned} {S}_{ij}(v)= \{s\mid 1\le s\le depth(v),{Seq^{{\mathcal {A}}}_s}(v)=a_i \& {Seq^{{\mathcal {B}}}_s}(v)=b_j\}, \end{aligned}$$
(109)

where depth(v) is the depth of node v in the tree, \(Seq^{{\mathcal {A}}}_s(v)\) and \(Seq^{{\mathcal {B}}}_s(v)\) stand for the character at position s in the strings \(Seq^{{\mathcal {A}}}(v)\) and \(Seq^{{\mathcal {B}}}\)(v), respectively. Now, consider a node \(v^*\) in the graph that satisfies the following constraints:

$$\begin{aligned} |{S}_{ij}(v)|= & {} n_{ij}, \forall 1\le i\le k,1\le j\le l. \end{aligned}$$
(110)

The number of nodes v that satisfy this constraint is \(\left( {\begin{array}{c}n^*\\ n_{11},\ldots ,n_{kl}\end{array}}\right) \). Moreover, define

$$\begin{aligned} |V_{n_{11},\ldots ,n_{kl}}|= & {} \{v\in V(G)\mid |{S}_{ij}(v)|=n_{ij}, \forall 1\le i\le k,1\le j\le l\}. \end{aligned}$$
(111)

1.1 Appendix 4.1: Node \(v^*\) or one of its ancestors is designated as a bucket by Algorithm 4

Here, we prove that the node v or one of its ancestors is designated as a bucket by Algorithm 4. In order to show this, we need to prove that:

$$\begin{aligned} \frac{\varPhi (v^*)}{\varPsi (v^*)}\ge & {} e ^ {\sum n^*r^*_{ij}\log p_{ij}}e ^ {-\sum n^*r^*_{ij}\log q_{ij}}{p_{0}}{q_{0}} \ge N^{1+\delta - \lambda ^*}{p_{0}}{q_{0}}, \end{aligned}$$
(112)
$$\begin{aligned} \frac{\varPhi (v^*)}{{\varPsi }^{{\mathcal {A}}}(v^*)}\ge & {} e ^ {\sum n^*r^*_{ij}\log p_{ij}}e ^ {-\sum n^*r^*_{ij}\log p_i^{{\mathcal {A}}}}{p_{0}}{q_{0}} \ge N^{1 - \lambda ^*}{p_{0}}{q_{0}}, \end{aligned}$$
(113)
$$\begin{aligned} \frac{\varPhi (v^*)}{{\varPsi }^{{\mathcal {B}}}(v^*)}\ge & {} e ^ {\sum n^*r^*_{ij}\log p_{ij}}e ^ {-\sum n^*r^*_{ij}\log p_j^{{\mathcal {B}}}}{p_{0}}{q_{0}} \ge N^{\delta - \lambda ^*}{p_{0}}{q_{0}}, \end{aligned}$$
(114)

where \(p_{0}\) and \(q_{0}\) are defined as \(\prod _{i,j}p_{ij}\) and \(\min (\prod _{i,j}q_{ij},\prod _{i}{(p_i^{{\mathcal {A}}})}^{l},\prod _j{(p_j^{{\mathcal {B}}})}^{k})\). Note that, \(\varPhi (v^*)\), \(\varPsi (v^*)\), \({\varPsi }^{{\mathcal {A}}}(v^*)\) and \({\varPsi }^{{\mathcal {B}}}(v^*)\) are computed as follows

$$\begin{aligned} \varPhi (v^*)= & {} \prod _{i,j} p_{ij}^{n_{ij}} = e ^ {\sum _{i,j} n_{ij}\log p_{ij}}\ge e ^ {\sum _{i,j} (n^*r_{ij}^*+1)\log p_{ij}} \nonumber \\\ge & {} e ^ {\sum _{i,j} n^*r_{ij}^*\log p_{ij}}(\prod _{i,j}p_{ij}) =e ^ {\sum _{i,j} n^*r^*_{ij}\log p_{ij}}p_{0}, \end{aligned}$$
(115)
$$\begin{aligned} \varPsi (v^*)= & {} \prod _{i,j} q_{ij}^{n_{ij}} = e ^ {\sum _{i,j} n_{ij}\log q_{ij}} \nonumber \\\le & {} e ^ {\sum _{i,j} (n^*r_{ij}^*-1)\log q_{ij}} \le \frac{e ^ {\sum _{i,j} n^*r_{ij}^*\log q_{ij}}}{\prod _{i,j}q_{ij}} = \frac{e ^ {\sum _{i,j} n^*r_{ij}^*\log q_{ij}}}{q_{0}}, \end{aligned}$$
(116)
$$\begin{aligned} {\varPsi }^{{\mathcal {A}}}(v^*)= & {} \prod _{i,j} {(p_i^{{\mathcal {A}}})}^{n_{ij}} = e ^ {\sum _{i,j} n_{ij}\log p_i^{{\mathcal {A}}}} \nonumber \\\le & {} e ^ {\sum _{i,j} (n^*r_{ij}^*-1)\log {(p_i^{{\mathcal {A}}})}} \le \frac{e ^ {\sum _{i,j} n^*r_{ij}^*\log p_i^{{\mathcal {A}}}}}{\prod _{i,j}p_i^{{\mathcal {A}}}} = \frac{e ^ {\sum _{i,j} n^*r_{ij}^*\log p_i^{{\mathcal {A}}}}}{q_{0}}, \end{aligned}$$
(117)
$$\begin{aligned} {\varPsi }^{{\mathcal {B}}}(v^*)= & {} \prod _{i,j} {(p_j^{{\mathcal {B}}})}^{n_{ij}} = e ^ {\sum _{i,j} n_{ij}\log p_j^{{\mathcal {B}}}} \nonumber \\\le & {} e ^ {\sum _{i,j} (n^*r_{ij}^*-1)\log {p_j^{{\mathcal {B}}}}} \le \frac{e ^ {\sum _{i,j} n^*r_{ij}^*\log p_j^{{\mathcal {B}}}}}{\prod _{i,j}{p_j^{{\mathcal {B}}}}} = \frac{e ^ {\sum _{i,j} n^*r_{ij}^*\log p_j^{{\mathcal {B}}}}}{q_{0}}. \end{aligned}$$
(118)

Therefore, from Lemma 5 and [(11)–(77)] we conclude [(112)–(114)]. This means \(v^*\) or one of its ancestors is an accepted bucket.

1.2 Appendix 4.2: Proof of bounds [(35)–(38)]

First, we derive a lower bound on \(\alpha (G)\) as follows.

$$\begin{aligned} \alpha (G)= & {} \sum _{v\in V_{Buckets}(G)}\varPhi (v)\nonumber \\\ge & {} \sum _{v\in V_{n_{11},\ldots ,n_{kl}}}\varPhi (v) \end{aligned}$$
(119)
$$\begin{aligned}\ge & {} |V_{n_{11},\ldots ,n_{kl}}|\varPhi (v) \end{aligned}$$
(120)
$$\begin{aligned}\ge & {} |V_{n_{11},\ldots ,n_{kl}}|e ^ {\sum _{i,j} n^*r^*_{ij}\log p_{ij}}p_{0}, \end{aligned}$$
(121)

where \(V_{n_{11},\ldots ,n_{kl}}\) is the set of nodes that satisfies (110). \(|V_{n_{11},\ldots ,n_{kl}}|\) is lower bounded as

$$\begin{aligned} |V_{n_{11},\ldots ,n_{kl}}|= & {} \left( {\begin{array}{c}n^*\\ n_{11},\ldots ,n_{kl}\end{array}}\right) \nonumber \\\ge & {} \frac{n^*!}{(kl)!\prod _{i,j}{n_{ij}!}} \ge \frac{(\frac{n^*}{e})^n\sqrt{2\pi n^*}}{(kl)!\prod _{i,j} (\frac{n_{ij}}{e})^{n_{ij}}\sqrt{2\pi n_{ij}} e} \end{aligned}$$
(122)
$$\begin{aligned}\ge & {} \prod _{i,j}{(\frac{n_{ij}}{n^*}) ^ {-n_{ij}}} {(n^*)} ^ {\frac{1 - kl}{2}} \frac{ (2\pi ) ^ {\frac{1 - kl}{2}}e ^ {-kl}}{(kl)!} \end{aligned}$$
(123)
$$\begin{aligned}= & {} c e ^ {-\sum _{i,j} n_{ij}\log (\frac{n_{ij}}{n^*})} {(n^*)} ^ {\frac{1 - kl}{2}} \end{aligned}$$
(124)
$$\begin{aligned}\ge & {} c e ^ {-n^*\sum _{i,j} r^*_{ij}\log r^*_{ij}}, \end{aligned}$$
(125)

for some constant \(c=\frac{ (2\pi ) ^ {\frac{1 - kl}{2}}e ^ {-kl}}{(kl)!} {(n^*)} ^ {\frac{1 - kl}{2}}\) depending on n, k and l. (122) is true as for any natural number m we have \(\sqrt{2\pi m}{(\frac{m}{e})}^m\le m!< \sqrt{2\pi m}{(\frac{m}{e})}^me\). (74) follows as \(a\log \frac{1}{a}\) is an increasing function for \(0\le x\le 0.5\), and a decreasing function for \(0.5\le x\le 1\). Therefore, from (115) and (121), \({\alpha (G)} \ge N^{\max (1,\delta )-\lambda ^*}\) is concluded. Similarly, (34)–(38) are proved as follows.

$$\begin{aligned} \frac{\alpha (G)}{\beta (G)}= & {} \frac{\sum _{v\in V_{Buckets}(G)}\varPhi (v)}{\sum _{v\in V_{Buckets}(G)}\varPsi (v)} \ge N^{1+\delta - \lambda ^*}p_{0}q_{0}, \end{aligned}$$
(126)
$$\begin{aligned} \frac{\alpha (G)}{\gamma ^{{\mathcal {A}}}(G)}= & {} \frac{\sum _{v\in V_{Buckets}(G)}\varPhi (v)}{\sum _{v\in V_{Buckets}(G)}{\varPsi }^{{\mathcal {A}}}(v)} \ge N^{1 - \lambda ^*}p_{0}q_{0}, \end{aligned}$$
(127)
$$\begin{aligned} \frac{\alpha (G)}{\gamma ^{{\mathcal {B}}}(G)}= & {} \frac{\sum _{v\in V_{Buckets}(G)}\varPhi (v)}{\sum _{v\in V_{Buckets}(G)}{\varPsi }^{{\mathcal {B}}}(v)} \ge N^{\delta - \lambda ^*}p_{0}q_{0} , \end{aligned}$$
(128)

where (126) and (128) are concluded from (112)–(114) and the fact that \(\frac{\sum _i{a_i}}{\sum _ib_i}\ge c\) is true for any i if \(\frac{a_i}{b_i}\ge c\) and \(b_i>0\).

1.3 Appendix 4.3: Bounding number of nodes in the tree, i.e., (34)

The number of nodes in the decision tree defined in (57), is bounded by \(O(N^{\lambda ^*})\) using the following three lemmas.

Lemma 6

For any leaf node v of the decision tree defined in (57), we have

$$\begin{aligned} {\varPhi (v)}\ge & {} {N ^ {-\lambda ^*}}\min _{i,j} p_{ij}{p_0q_0}. \end{aligned}$$
(129)

Lemma 7

For any tree G, the summation of \(\varPhi (v)\) over all the leaf nodes is equal to one, i.e., \(\sum _{v\in V_{l}}\varPhi (v)=1\).

Lemma 8

The number of nodes in the decision tree defined in (57) is at most two times of the number of leaf nodes.

For proof of Lemmas 67 and 8 , see “Appendices 4.5, 4.6 and 4.7”. Therefore, we have

$$\begin{aligned}&1\nonumber \\&\quad =\sum _{v \in V_1(G)}{\varPhi (v)} \end{aligned}$$
(130)
$$\begin{aligned}&\quad \ge \sum _{v \in V(G)}N^{-\lambda ^*}\min _{i,j} p_{ij}{p_0q_0} \end{aligned}$$
(131)
$$\begin{aligned}&\quad =|V_l(G)|N^{-\lambda ^*}\min _{i,j} p_{ij}{p_0q_0} \end{aligned}$$
(132)
$$\begin{aligned}&\quad \ge \frac{|V(G)|}{2}N^{-\lambda ^*}\min _{i,j} p_{ij}{p_0q_0}, \end{aligned}$$
(133)

where (130) follows from Lemma 7, (131) is true from (167) and (133) is concluded from Lemma 8. Therefore, we conclude that \(|V(G)|=O( N^{\lambda ^*})\).

1.4 Appendix 4.4: Proof of Lemma 5

Consider the optimization problem of finding the member of \((\lambda ,r_{ij},n)\in {\mathcal {R}}(\lambda ,r_{ij},n)\) with minimum \(\lambda \). This optimization problem is a convex optimization problem.Footnote 12 Therefore, writing the KKT conditions, we have

$$\begin{aligned}&F(r_{ij},n,\lambda )\nonumber \\&\quad =\lambda + \sum _{i,j}\mu _{1ij}( -r_{ij}) +\mu _2\big (\frac{(\max (1,\delta )-\lambda ) \log N}{n}-\sum _{i,j}{r_{ij}\log p_{ij}}\nonumber \\&\qquad +\sum _{i,j}{r_{ij}\log r_{ij}}\big )\nonumber \\&\qquad +\mu _3\big (\frac{(1-\lambda ) \log N}{n}-\sum _{i,j}{r_{ij}\log p_{ij}}+\sum _{i,j}{r_{ij}\log p_i^{{\mathcal {A}}}}\big )\nonumber \\&\qquad +\mu _4\big (\frac{(\delta -\lambda ) \log N}{n}-\sum _{i,j}{r_{ij}\log p_{ij}}+\sum _{i,j}{r_{ij}\log p_j^{{\mathcal {B}}}}\big )\nonumber \\&\qquad +\mu _5\big ( \frac{(1+\delta -\lambda ) \log N}{n}-\sum _{i,j}{r_{ij}\log p_{ij}}+\sum _{i,j}{r_{ij}\log q_{ij}} \big )\nonumber \\&\qquad + \mu _6\big (\sum _{i,j}{r_{ij}}-1\big ), \end{aligned}$$
(135)

where

$$\begin{aligned} \mu _2,\mu _3,\mu _4,\mu _5,r_{ij}\ge 0, \mu _{1ij}r_{ij}=0, \sum _{i,j}{r_{ij}}-1= & {} 0, \end{aligned}$$
(136)
$$\begin{aligned} \frac{(\max (1,\delta )-\lambda ) \log N}{n}-\sum _{i,j}{r_{ij}\log p_{ij}}+\sum _{i,j}{r_{ij}\log r_{ij}}\le & {} 0, \end{aligned}$$
(139)
$$\begin{aligned} \mu _2\left( \frac{(\max (1,\delta )-\lambda ) \log N}{n}-\sum _{i,j}{r_{ij}\log p_{ij}}+\sum _{i,j}{r_{ij}\log r_{ij}} \right)= & {} 0, \end{aligned}$$
(140)
$$\begin{aligned} \frac{(1-\lambda ) \log N}{n}-\sum _{i,j}{r_{ij}\log p_{ij}}+\sum _{i,j}{r_{ij}\log p_{i}^{{\mathcal {A}}}}\le & {} 0, \end{aligned}$$
(141)
$$\begin{aligned} \mu _3\left( \frac{(1-\lambda ) \log N}{n}-\sum _{i,j}{r_{ij}\log p_{ij}}+\sum _{i,j}{r_{ij}\log p_{i}^{{\mathcal {A}}}}\right)= & {} 0, \end{aligned}$$
(142)
$$\begin{aligned} \frac{(\delta -\lambda ) \log N}{n}-\sum _{i,j}{r_{ij}\log p_{ij}}+\sum _{i,j}{r_{ij}\log p_j^{{\mathcal {B}}}}\le & {} 0, \end{aligned}$$
(143)
$$\begin{aligned} \mu _4\left( \frac{(\delta -\lambda ) \log N}{n}-\sum _{i,j}{r_{ij}\log p_{ij}}+\sum _{i,j}{r_{ij}\log p_j^{{\mathcal {B}}}}\right)= & {} 0, \end{aligned}$$
(144)
$$\begin{aligned} \frac{(1+\delta -\lambda ) \log N}{n}-\sum _{i,j}{r_{ij}\log p_{ij}}+\sum _{i,j}{r_{ij}\log q_{ij}}\le & {} 0, \end{aligned}$$
(145)
$$\begin{aligned} \mu _5\left( \frac{(1+\delta -\lambda ) \log N}{n}-\sum _{i,j}{r_{ij}\log p_{ij}}+\sum _{i,j}{r_{ij}\log q_{ij}}\right)= & {} 0. \end{aligned}$$
(146)

From (138), \(\mu _{1ij}\) is zero if \(r_{ij}\) is a non-zero number. Therefore, we only keep i and j where \(r_{ij}\ne 0\) and \(\mu _{1ij}=0\).

$$\begin{aligned}&\frac{d F(r_{ij},n,\lambda )}{dr_{ij}} = 0 \rightarrow \mu _2 + \mu _2\log r_{ij}+ \mu _3\log p_{i}^{{\mathcal {A}}}+ \mu _4\log p_j^{{\mathcal {B}}}+ \mu _5\log q_{ij}+ \mu _6\nonumber \\&\qquad -(\mu _2+\mu _3+\mu _4+\mu _5)\log p_{ij}= 0 \end{aligned}$$
(147)
$$\begin{aligned}&\quad \rightarrow r_{ij} ^ {\mu _2} = p_{ij} ^ {\mu _2+\mu _3+\mu _4+\mu _5} {(p_i^{{\mathcal {A}}})} ^ {-\mu _3}{( p_j^{{\mathcal {B}}})} ^ {-\mu _4}q_{ij} ^ {-\mu _5} e ^ {-\mu _2 - \mu _6}. \end{aligned}$$
(148)

Consider the following two cases.

  1. 1.

    \(\mu _2=0\). In this case, all the constraints are affine functions and therefore we have a linear programming problem and the feasible set of this linear programming problem is a polyhedron. From (103), the polyhedron is bounded, i.e., \(0\le r_{ij}\le M\) for some constant M. Assume that the polyhedron is nonempty, otherwise the solution is \(\infty \). Moreover, a nonempty bounded polyhedron cannot contain a line, thus it must have a basic feasible solution and the optimal solutions are restricted to the corner points.

  2. 2.

    \(\mu _2\ne 0\). As \(r_{ij}\ne 0\) and \(\mu _{1ij}=0\), we have

    $$\begin{aligned}&\frac{d F(r_{ij},n,\lambda )}{dr_{ij}} = 0 \rightarrow \mu _2 + \mu _2\log r_{ij}+ \mu _3\log p_{i}^{{\mathcal {A}}}\nonumber \\&\qquad + \mu _4\log p_j^{{\mathcal {B}}}+ \mu _5\log q_{ij}+ \mu _6\nonumber \\&\qquad -(\mu _2+\mu _3+\mu _4+\mu _5)\log p_{ij}= 0 \end{aligned}$$
    (149)
    $$\begin{aligned}&\quad \rightarrow r_{ij} ^ {\mu _2} = p_{ij} ^ {\mu _2+\mu _3+\mu _4+\mu _5} {(p_i^{{\mathcal {A}}})} ^ {-\mu _3}{(p_j^{{\mathcal {B}}})} ^ {-\mu _4}q_{ij} ^ {-\mu _5} e ^ {-\mu _2 - \mu _6} \end{aligned}$$
    (150)
    $$\begin{aligned}&\quad \rightarrow r_{ij} =cp_{ij} ^ {\frac{\mu _2+\mu _3+\mu _4+\mu _5}{\mu _2}} {(p_i^{{\mathcal {A}}})} ^ {-\frac{\mu _3}{\mu _2}} {(p_j^{{\mathcal {B}}})} ^ {-\frac{\mu _4}{\mu _2}} q_{ij} ^ {-\frac{\mu _5}{\mu _2}}, \end{aligned}$$
    (151)
    $$\begin{aligned}&\qquad \frac{d F(r_{ij},n,\lambda )}{dn} = 0 \nonumber \\&\quad \rightarrow -\left( \mu _2(\max (1,\delta )-\lambda ) +\mu _3(1-\lambda ) +\mu _4(\delta -\lambda )\right. \nonumber \\&\qquad \left. +\mu _5(1+\delta -\lambda ) \right) \frac{\log N}{n^2}=0 \end{aligned}$$
    (152)
    $$\begin{aligned}&\quad \rightarrow \mu _2(\max (1,\delta )-\lambda ) +\mu _3(1-\lambda ) \nonumber \\&\qquad +\mu _4(\delta -\lambda )+\mu _5(1+\delta -\lambda )=0 \end{aligned}$$
    (153)
    $$\begin{aligned}&\quad \rightarrow \lambda =\frac{\mu _2\max (1,\delta )+\mu _3+\mu _4\delta +\mu _5(1+\delta )}{\mu _2+\mu _3+\mu _4+\mu _5}. \end{aligned}$$
    (154)

    Summing (125), (142), (144) and (146) and using (151), we have

    $$\begin{aligned}&\left( \mu _2(\max (1,\delta )-\lambda ) +\mu _3(1-\lambda ) \right. \left. +\mu _4(\delta -\lambda )+\mu _5(1+\delta -\lambda ) \right) \frac{\log N}{n}\nonumber \\&\quad =\mu _2\log c. \end{aligned}$$
    (155)

    From (154) and \(\mu _2\ne 0\), we conclude that \(c=1\). Moreover, we have \(\lambda =\frac{\max (1,\delta )+\mu +\nu \delta }{1+\mu +\nu +\eta }\), where \(\mu \), \(\nu \) and \(\eta \) are defined as:

    $$\begin{aligned} \mu= & {} \frac{\mu _3+\mu _5}{\mu _2}, \end{aligned}$$
    (156)
    $$\begin{aligned} \nu= & {} \frac{\mu _4+\mu _5}{\mu _2}, \end{aligned}$$
    (157)
    $$\begin{aligned} \eta= & {} \frac{\mu _2'-\mu _5}{\mu _2}. \end{aligned}$$
    (158)

    Assume that \(\eta >0\), therefore \(r_{ij}<p_{ij}\) as \(p_{ij}\le \min (q_i,q_j)\). On the other hand, \(\sum _{i,j}{r}_{ij}=\sum p_{ij}=1\) which contradicts our assumption that \(\eta >0\). Thus, \(\eta \le 0\). Define \((\mu ^*,\nu ^*,\eta ^*)\) as follows

    $$\begin{aligned} (\mu ^*,\nu ^*,\eta ^*)&=\arg \max _{\min (\mu ,\nu )\ge \eta \ge 0,\sum _{i,j}p_{ij} ^ {1+\mu +\nu -\eta } {(p_{i}^{{\mathcal {A}}})} ^ {-\mu } {(p_j^{{\mathcal {B}}})} ^ {-\nu } = 1}\nonumber \\&\quad \frac{\max (1,\delta )+\mu +\nu \delta }{1+\mu +\nu -\eta }. \end{aligned}$$
    (159)

    Therefore, from (140), (151) and (154) we conclude Lemma 5 as follows

    $$\begin{aligned} r_{ij}^*= & {} p_{ij} ^ {1+\mu ^*+\nu ^*-\eta ^*} {(p_{i}^{{\mathcal {A}}})} ^ {-\mu ^*} {(p_j^{{\mathcal {B}}})} ^ {-\nu ^*}, \end{aligned}$$
    (160)
    $$\begin{aligned} \lambda ^*= & {} \frac{\max (1,\delta )+\mu ^*+\nu ^*\delta }{1+\mu ^*+\nu ^*-\eta ^*}, \end{aligned}$$
    (161)
    $$\begin{aligned} n^*= & {} \frac{(\max (1,\delta )-\lambda ^*) \log N}{\sum r_{ij}^*\log \frac{p_{ij}}{r_{ij}^*}}. \end{aligned}$$
    (162)

1.5 Appendix 4.5: Proof of Lemma 6

In order to prove Lemma 6, consider a leaf node \(v_0\) and its parent \(v_1\). From (57), for the node \(v_1\) we have

$$\begin{aligned} \frac{\varPhi (v_1)}{\varPsi (v_1)}\le & {} {N ^ {1+\delta -\lambda ^*}}{p_0q_0}, \end{aligned}$$
(163)
$$\begin{aligned} \frac{\varPhi (v_1)}{{\varPsi }^{{\mathcal {A}}}(v_1)}\ge & {} N^{1-\lambda ^*}{p_0q_0}, \end{aligned}$$
(164)
$$\begin{aligned} \frac{\varPhi (v_1)}{{\varPsi }^{{\mathcal {B}}}(v_1)}\ge & {} N^{\delta -\lambda ^*}{p_0q_0}. \end{aligned}$$
(165)

(163)–(165) follow from (57) and the fact that the node \(v_1\) is niether pruned nor accepted as it is not a leaf node. Therefore, from (164)\(\times \)(165)/(163), we conclude that,

$$\begin{aligned} {\varPhi (v_1)}\ge & {} {N ^ {-\lambda ^*}}{p_0q_0}. \end{aligned}$$
(166)

Therefore, from definition of \(\varPhi (v)\), for the leaf node \(v_0\) we have

$$\begin{aligned} {\varPhi (v_0)}\ge & {} {N ^ {-\lambda ^*}}\min _{i,j} p_{ij}{p_0q_0}. \end{aligned}$$
(167)

(167) is true for all the leaf nodes as (163)–(167) were derived independent of the choice of the leaf node.

1.6 Appendix 4.6: Proof of Lemma 7

The proof is straightforward based on proof of Lemma 4. \(\sum _{v\in V_{l}}\varPhi (v)=1\) is proved by the induction on depth of the tree. For the tree G with \(depth(G)=1\), \(\sum _{v\in V_{l}}\varPhi (v)=1\) is trivial as for the children \(v_{ij}\) of the root we have \(\varPhi (v_{ij})=p_{ij}\) and \(\sum _{ij}p_{ij}=1\). Assume that \(\sum _{v\in V_{l}}\varPhi (v)=1\) is true for all the trees G with \(depth(G)\le depth\). Our goal is to prove that \(\sum _{v\in V_{l}}\varPhi (v)=1\) is true for all the trees G with \(depth(G)= depth+1\). Consider a tree G with \(depth(G)= depth+1\) and the tree \(G'\) obtained by removing all the nodes at depth \( depth+1\).

$$\begin{aligned}&\sum _{v\in V_{l}(G)}\varPhi (v)\nonumber \\&\quad =\sum _{v\in V_{l}(G),depth(v)=depth+1}\varPhi (v) \nonumber \\&\qquad -\sum _{w\in V(G), w \text{ is } \text{ a } \text{ parent } \text{ of } \text{ a } \text{ leaf } \text{ node },depth(w)=depth}\varPhi (w)\nonumber \\&\qquad +\sum _{v\in V_{l}(G')}\varPhi (v) \end{aligned}$$
(168)
$$\begin{aligned}&\quad =\sum _{v\in V_{l}(G')}\varPhi (v) \end{aligned}$$
(169)
$$\begin{aligned}&\quad =1. \end{aligned}$$
(170)

(168) is a result of definition of \(G'\), i.e., \(G'\) is obtained from G by removing all the nodes at depth \( depth+1\). (169) is true as for any node w and its children \(v_{ij}\) we have \(\varPhi (w)=\sum _{ij} \varPhi (v_{ij})\) which is a result of the fact that \(\varPhi (v_{ij})=\varPhi (w)p_{ij}\). (170) is concluded from the induction assumption, i.e., \(\sum _{v\in V_{l}}\varPhi (v)=1\) is true for all the trees G with \(depth(G)\le depth\).

1.7 Appendix 4.7: Proof of Lemma 8

For any decision tree which at each node either it is pruned, accepted or branched into kl children, number of nodes in the tree is at most two times number of leaf nodes, i.e., \(|V(G)|\le 2|V_l(G)|\). This is true by induction on the depth of the tree. For a tree G with \(depth(G)=1\), we have \(|V_l(G)|=kl\) and \(|V(G)|=kl+1\). Therefore, Lemma 8 is true in this case. Assume that \(|V(G)|\le 2|V_l(G)|\) is true for all the trees G with \(depth(G)\le depth\). Our goal is to prove that \(|V(G)|\le 2|V_l(G)|\) is true for all the trees G with \(depth(G)= depth+1\). Consider a tree G with \(depth(G)=depth+1\). Consider the tree \(G'\) obtained by removing all the nodes v where \(depth(v)=depth+1\). Assume there are klr of them (each intermediate node has kl children). Therefore, we have \(|V(G)|=|V(G')|+klr\), \(|V_l(G)|=|V_l(G')|+(kl-1)r\) and \(|V(G')|\le 2|V_l(G')|\). This results in \(|V(G)|\le 2|V_l(G)|\).

Appendix 5: Proof of Theorem 4

First, note that from \(\frac{p_{ij}}{1+\epsilon }\le p'_{ij}\le {p_{ij}}{(1+\epsilon )}\), we conclude that \(\frac{p_{i}^{{\mathcal {A}}}}{1+\epsilon }\le {p'}_{i}^{{\mathcal {A}}}\le {{p}_{i}^{{\mathcal {A}}}}{(1+\epsilon )}\), \(\frac{p_j^{{\mathcal {B}}}}{1+\epsilon }\le {p'}_{j}^{{\mathcal {B}}}\le {p_j^{{\mathcal {B}}}}{(1+\epsilon )}\) and \(\frac{q_{ij}}{{{(1+\epsilon )}^2}}\le q'_{ij}\le {q_{ij}}{{{(1+\epsilon )}^2}}\). Assume that \(depth(G)=d\). Define the variables \(\varPhi (v)\), \({\varPsi }^{{\mathcal {A}}}(v)\), \({\varPsi }^{{\mathcal {B}}}(v)\), \(\varPsi (v)\), \(\alpha (G)\), \(\gamma ^{{\mathcal {A}}}(G)\), \(\gamma ^{{\mathcal {B}}}(G)\), \(\beta (G)\) and TP for the tree with the distribution \(p'\) as \(\varPhi '(v)\), \(\varPsi '^{{\mathcal {A}}}(v)\), \(\varPsi '^{{\mathcal {B}}}(v)\), \(\varPsi '(v)\), \(\alpha '(G)\), \(\gamma '^{{\mathcal {A}}}(G)\), \(\gamma '^{{\mathcal {B}}}(G)\), \(\beta '(G)\) and \(TP'\). Therefore, from (29)–(32) we have

$$\begin{aligned} \alpha (G)= & {} \sum _{v\in V_{Buckets}(G)}\varPhi (v) \end{aligned}$$
(171)
$$\begin{aligned}\le & {} \sum _{v\in V_{Buckets}(G)}\varPhi '(v) (1+\epsilon )^d \end{aligned}$$
(172)
$$\begin{aligned}\le & {} \alpha '(G) (1+\epsilon )^d, \end{aligned}$$
(173)

(172) follows from \(\varPhi (v)\le \varPhi '(v)(1+\epsilon )^d\) which is a result of the definition of \(\varPhi (v)\) in (22), i.e., \(\varPhi (f(v,a_i,b_j))=\varPhi (v)p_{ij},\forall v\in V\). Similarly, we have

$$\begin{aligned} \frac{\alpha (G)}{(1+\epsilon )^d}\le&\alpha '(G)&\le \alpha (G) (1+\epsilon )^d, \end{aligned}$$
(174)
$$\begin{aligned} \frac{\gamma ^{{\mathcal {A}}}(G)}{(1+\epsilon )^d}\le&\gamma '^{{\mathcal {A}}}(G)&\le \gamma ^{{\mathcal {A}}}(G) (1+\epsilon )^d, \end{aligned}$$
(175)
$$\begin{aligned} \frac{\gamma ^{{\mathcal {B}}}(G)}{(1+\epsilon )^d}\le&\gamma '^{{\mathcal {B}}}(G)&\le \gamma ^{{\mathcal {B}}}(G) (1+\epsilon )^d, \end{aligned}$$
(176)
$$\begin{aligned} \frac{\beta (G)}{(1+\epsilon )^{2d}}\le&\beta '(G)&\le \beta (G) (1+\epsilon )^{2d}. \end{aligned}$$
(177)

On the other hand, from (48)–(51), we have

$$\begin{aligned} TP= & {} 1-{(1-\alpha '(G))}^{\#bands}\ge 1-{(1-\frac{\alpha (G)}{(1+\epsilon )^d})}^{\#bands}, \end{aligned}$$
(178)

Due to the inequality \((1-x)^{\frac{c}{x}}< e^{-c}\), the minimum possible value of \(\#bands\) to ensure true positive rate TP can be computed as

$$\begin{aligned} \#bands= & {} \lceil \frac{\log \frac{1}{1-TP}}{\frac{\alpha (G)}{(1+\epsilon )^d}}\rceil . \end{aligned}$$
(179)

Thus, the total complexity is computed as

$$\begin{aligned}&c_{tree}|V(G)|+\left( \frac{c_{hash}N}{\alpha (G)}+\frac{c_{hash}M}{\alpha (G)}+\frac{c_{insertion}N\gamma '^{{\mathcal {A}}}(G)}{\alpha (G)}+\frac{c_{insertion}M\gamma '^{{\mathcal {B}}}(G)}{\alpha (G)}\right. \nonumber \\&\qquad \left. +\frac{c_{pos}MN\beta '(G)}{\alpha (G)}\right) .{{(1+\epsilon )^d}\log \frac{1}{1-TP}} \end{aligned}$$
(180)
$$\begin{aligned}&\quad \le {(1+\epsilon )}^{3d} N^{\lambda ^{*}} \end{aligned}$$
(181)
$$\begin{aligned}&\quad \le N^{\lambda ^{*}+3c_d\log (1+\epsilon )}, \end{aligned}$$
(182)

where (181) follows from (174)–(177) and the fact that the total complexity for distribution \(p'\) is \(O(N^{\lambda ^{*}(p')})\). Finally, (182) is obtained from the following lemma in which we prove that the depth of the tree is bounded by \(c_d\log N\) where \(c_d\) is a constant depending on the distribution.

Lemma 9

For the decision tree G, \(depth(G)=c_d\log N\) where \(c_d=\min \big (\frac{{(\lambda ^*-1)}}{\log (\max _{i,j}\frac{p_{ij}}{p_i^{{\mathcal {A}}}})},\frac{{(\lambda ^*-\delta )}}{\log (\max _{i,j}\frac{p_{ij}}{p_j^{{\mathcal {B}}}})}\big )\).

For proof of Lemma 9, see “Appendix 5.1”.

1.1 Appendix 5.1: Proof of Lemma 9

From (57), for the decision tree G we have

$$\begin{aligned} \left\{ \begin{array}{cl}\frac{\varPhi (v)}{\varPsi (v)} \ge {N ^ {1+\delta -\lambda ^*}}{p_0q_0}&{}: \text{ Accept } \text{ bucket },\\ \frac{\varPhi (v)}{{\varPsi }^{{\mathcal {A}}}(v)}\le N^{1-\lambda ^*}{p_0q_0}&{}: \text{ Prune },\\ \frac{\varPhi (v)}{{\varPsi }^{{\mathcal {B}}}(v)}\le N^{\delta -\lambda ^*}{p_0q_0}&{}: \text{ Prune },\\ otherwise&{}: \text{ Branch } \text{ into } \text{ the } kl \text{ children. } \end{array}\right. \end{aligned}$$
(183)

Our goal here is to prove that for any pruned node v and any accepted node v, \(depth(v)\le c_d\log N\) where \(c_d\) is a constant depending on the distribution. Consider a pruned or accepted node v. For its parent w, we have

$$\begin{aligned} N^{1-\lambda ^*}{p_0q_0}< & {} \frac{\varPhi (v)}{{\varPsi }^{{\mathcal {A}}}(v)}. \end{aligned}$$
(184)

Therefore, we conclude that

$$\begin{aligned} d\le & {} \frac{{(\lambda ^*-1)}\log N}{\log (\max _{i,j}\frac{p_{ij}}{p_i^{{\mathcal {A}}}})}. \end{aligned}$$
(185)

Similarly, we have

$$\begin{aligned} N^{\delta -\lambda ^*}{p_0q_0}< & {} \frac{\varPhi (v)}{{\varPsi }^{{\mathcal {B}}}(v)} \end{aligned}$$
(186)
$$\begin{aligned} \rightarrow d\le & {} \frac{{(\lambda ^*-\delta )}\log N}{\log (\max _{i,j}\frac{p_{ij}}{p_j^{{\mathcal {B}}}})}. \end{aligned}$$
(187)

Thus, \(c_d\) is defined as \( \min \big (\frac{{(\lambda ^*-1)}}{\log (\max _{i,j}\frac{p_{ij}}{p_i^{{\mathcal {A}}}})},\frac{{(\lambda ^*-\delta )}}{\log (\max _{i,j}\frac{p_{ij}}{p_j^{{\mathcal {B}}}})}\big )\). (185) and (187) are true as \(p_0,q_0\le 1\).

Appendix 6: Pseudo code

Here, we present the pseudo code to compute the complexity of the algorithm in Dubiner (2012) in the case of hamming distance (see Experiment 1).

figure f

Appendix 7: Further discussion on MIPS

In order to use MIPS to solve this problem, i.e., (2), we need to derive optimal weights \(\omega _{ij}\) to minimize the norm \(M^2\) in Shrivastava and Li (2014). The term M stands for the radius of the space which is computed as follows: \(M^2={\mathbb {E}}{\big (||x||\big )}^2+{\mathbb {E}}{\big (||y||\big )}^2\). Therefore, from (69)–(72) we conclude that \(M^2=\sum _{ij}\left( p_j^{{\mathcal {B}}}\omega _{ij}^2+\frac{p_i^{{\mathcal {A}}}{\log ^2(\frac{p_{ij}}{q_{ij}})}}{\omega _{ij}^2}\right) \) which results in optimal \(\omega _{ij}={\big (\frac{p_i^{{\mathcal {A}}}}{p_j^{{\mathcal {B}}}}\big )}^{0.25}{\big (\mid \log \frac{p_{ij}}{q_{ij}}\mid \big )}^{0.5}\). On the other hand, for \((x,y)\sim Q(x,y)\) we have

$$\begin{aligned} {\mathbb {E}}\big (||x||||y||\big )\ge {\mathbb {E}}(<T(x),T(y)>) =S\sum _{ij}q_{ij}{\mid \log (\frac{p_{ij}}{q_{ij}})\mid }. \end{aligned}$$
(188)

In order to have nearly one true positive rate and sub-quadratic complexity we need \(S_0\le S d_{KL}(p_{ij}||q_{ij})\) and \(cS_0\ge -S d_{KL}(q_{ij}||p_{ij})\) where \(d_{KL}\) stands for kullback leibler divergence. Moreover, we should have \(M^2\ge S\sum _{ij}\sqrt{q_{ij}}|\log (\frac{p_{ij}}{q_{ij}})|\). Setting \(c=0\), \(S_0\) and M as above, the complexity will be more than 1.9 for any \(2\times 2\) probability distribution matrix. The reason is that the transferred data points are nearly orthogonal to each other and this makes it very slow to find maximum inner product using the existing method (Shrivastava and Li 2014).

Appendix 8: Complexities of MinHash, LSH-hamming and ForestDSH

In this section, we derive the complexities of MinHash, LSH-hamming and ForestDSH in the case of \({P}_1=\begin{bmatrix} 0.345 &{} 0 \\ 0.31 &{} 0.345 \end{bmatrix}\). Complexities are computed for any \(2\times 2\) probability distributions similarly.

1.1 Appendix 8.1: Complexity of MinHash

For MinHash the query complexity is

$$\begin{aligned} N^{\min (mh_1,mh_2,mh_3,mh_4)}, \end{aligned}$$
(189)

where \(mh_1={\frac{\log \frac{p_{00}}{1-p_{11}}}{\log \frac{q_{00}}{1-q_{11}}}}\), \(mh_2={\frac{\log \frac{p_{01}}{1-p_{10}}}{\log \frac{q_{01}}{1-q_{10}}}}\), \(mh_3={\frac{\log \frac{p_{10}}{1-p_{01}}}{\log \frac{q_{10}}{1-q_{01}}}}\) and \(mh_4={\frac{\log \frac{p_{11}}{1-p_{00}}}{\log \frac{q_{11}}{1-q_{00}}}}\). For \({P}_1\), the per query complexity is derived and is equal to 0.5207.

1.2 Appendix 8.2: Complexity of LSH-hamming

In the case of LSH-hamming, the query complexity is

$$\begin{aligned} O\left( N^{\min \left( \frac{\log (p_{00}+p_{11})}{\log (q_{00}+q_{11})}, \frac{\log (p_{01}+p_{10})}{\log (q_{01}+q_{10})}\right) }\right) , \end{aligned}$$
(190)

and the storage required for the algorithm is \(O(N^{1+\min (\frac{\log (p_{00}+p_{11})}{\log (q_{00}+q_{11})},\frac{\log (p_{01}+p_{10})}{\log (q_{01}+q_{10})})})\). Similarly for \({P}_1\), the per query complexity is derived and is equal to 0.4672.

1.3 Appendix 8.3: Complexity of ForestDSH

From Definition 4, we derive \(\lambda ^*\) as follows

$$\begin{aligned} (\mu ^*,\nu ^*,\eta ^*)= & {} \arg \max _{\min (\mu ,\nu )\ge \eta >0,\sum _{i,j}p_{ij} ^ {1+\mu +\nu -\eta } {(p_{i}^{{\mathcal {A}}})} ^ {-\mu } {(p_j^{{\mathcal {B}}})} ^ {-\nu } = 1}\frac{1+\mu +\nu }{1+\mu +\nu -\eta } \end{aligned}$$
(191)
$$\begin{aligned}= & {} ( 4.6611,4.6611, 3.1462) \end{aligned}$$
(192)
$$\begin{aligned} \lambda ^*= & {} \frac{1+\mu ^*+\nu ^*}{1+\mu ^*+\nu ^*-\eta ^*} \end{aligned}$$
(193)
$$\begin{aligned}= & {} 1.4384. \end{aligned}$$
(194)

Note that \(\delta =1\) and the per query complexity is equal to 0.4384.

Appendix 9: Joint probability distributions learned on mass spectrometry data

The mass spectrometry data for experiment 4, is shown in Fig. 10a–c in case of \(\log Rank\) at base 4 (a \(4\times 4\) matrix), \(\log Rank\) at base 2 (an \(8\times 8\) matrix), and no \(\log Rank\) transformation (a \(51\times 51\) matrix). For the mass spectrometry data shown in Fig. 10a, the probability distribution \(P_{4\times 4}\) is given in (195). Note that, in the case of LSH-hamming the query complexity for these \(4\times 4\), \(8\times 8\) and \(51\times 51\) matrices are 0.901, 0.890 and 0.905, respectively. Similarly, per query complexity for MinHash for these \(4\times 4\), \(8\times 8\) and \(51\times 51\) matrices are 0.4425, 0.376 and 0.386, respectively.

For the mass spectrometry data shown in Fig. 10a, b, the probability distribution p(xy) is represented as

$$\begin{aligned} P_{4\times 4}= & {} \begin{bmatrix} 0.000125&{} 5.008081\times {10}^{-5}&{} 9.689274\times {10}^{-8}&{} 0.000404\\ 5.008082\times {10}^{-5}&{} 0.000209&{} 6.205379\times {10}^{-6}&{} 0.001921\\ 9.689274\times {10}^{-8}&{} 6.205379\times {10}^{-6}&{} 2.688879\times {10}^{-5}&{} 0.000355\\ 0.000404&{} 0.001921&{} 0.000355&{} 0.994165 \end{bmatrix}, \nonumber \\\end{aligned}$$
(195)
$$\begin{aligned} P_{8\times 8}= & {} \left[ \begin{array}{ccccc} 3.458\times {10}^{-5}&{} 1.442\times {10}^{-5}&{} 5.434\times {10}^{-6}&{} 1.723\times {10}^{-6}\\ 1.442\times {10}^{-5}&{} 3.708\times {10}^{-5}&{} 2.550\times {10}^{-5}&{} 8.706\times {10}^{-6}\\ 5.434\times {10}^{-6}&{} 2.550\times {10}^{-5}&{} 3.907\times {10}^{-5}&{} 2.948\times {10}^{-5}\\ 1.723\times {10}^{-6}&{} 8.706\times {10}^{-6}&{} 2.948\times {10}^{-5}&{} 4.867\times {10}^{-5}\\ 2.921\times {10}^{-7}&{} 1.561\times {10}^{-6}&{} 6.442\times {10}^{-6}&{} 1.813\times {10}^{-5} &{} \\ 7.496\times {10}^{-8}&{} 4.809\times {10}^{-7}&{} 2.008\times {10}^{-6}&{} 6.098\times {10}^{-6}\\ 6.718\times {10}^{-8}&{} 2.680\times {10}^{-7}&{} 1.251\times {10}^{-6}&{} 4.531\times {10}^{-6}\\ 5.023\times {10}^{-5}&{} 1.574\times {10}^{-4}&{} 3.671\times {10}^{-4}&{} 5.539\times {10}^{-4} \end{array}\right. \end{aligned}$$
(196)
$$\begin{aligned}&\left. \begin{array}{ccccc} {} &{} 2.920\times {10}^{-7}&{} 7.496\times {10}^{-8}&{} 6.718\times {10}^{-8}&{} 5.023\times {10}^{-5}\\ {}&{} 1.561\times {10}^{-6}&{} 4.809\times {10}^{-7}&{} 2.680\times {10}^{-7}&{} 1.575\times {10}^{-4}\\ {} &{} 6.442\times {10}^{-6}&{} 2.008\times {10}^{-6}&{} 1.251\times {10}^{-6}&{} 3.672\times {10}^{-4}\\ {} &{} 1.813\times {10}^{-5}&{} 6.098\times {10}^{-6}&{} 4.532\times {10}^{-6}&{} 5.539\times {10}^{-4}\\ {} &{} 2.887\times {10}^{-5}&{} 6.892\times {10}^{-6}&{} 5.309\times {10}^{-6}&{} 4.138\times {10}^{-4}\\ {} &{} 6.892\times {10}^{-6}&{} 2.123\times {10}^{-5}&{} 5.826\times {10}^{-6}&{} 3.246\times {10}^{-4}\\ {} &{} 5.309\times {10}^{-6}&{} 5.826\times {10}^{-6}&{} 6.411\times {10}^{-5}&{} 8.364\times {10}^{-4}\\ {}&{} 4.138\times {10}^{-4}&{} 3.246\times {10}^{-4}&{} 8.364\times {10}^{-4}&{} 0.994 \end{array}\right] . \end{aligned}$$
(197)

From (56), for \(P_{4\times 4}\), \((\mu ^*,\nu ^*,\eta ^*,\lambda ^*)\) are derived as

$$\begin{aligned} \mu ^*= & {} 1.151016, \end{aligned}$$
(198)
$$\begin{aligned} \nu ^*= & {} 1.151016, \end{aligned}$$
(199)
$$\begin{aligned} \eta ^*= & {} 0.813168, \end{aligned}$$
(200)
$$\begin{aligned} \lambda ^*= & {} 1.326723. \end{aligned}$$
(201)

Similarly, for \(P_{8\times 8}\), we have

$$\begin{aligned} \mu ^*= & {} 0.871147, \end{aligned}$$
(202)
$$\begin{aligned} \nu ^*= & {} 0.871147, \end{aligned}$$
(203)
$$\begin{aligned} \eta ^*= & {} 0.624426, \end{aligned}$$
(204)
$$\begin{aligned} \lambda ^*= & {} 1.294837. \end{aligned}$$
(205)

For the mass spectrometry data shown in Fig. 10c, \((\mu ^*,\nu ^*,\eta ^*,\lambda ^*)\) are

$$\begin{aligned} \mu ^*= & {} 0.901208, \end{aligned}$$
(206)
$$\begin{aligned} \nu ^*= & {} 0.901208, \end{aligned}$$
(207)
$$\begin{aligned} \eta ^*= & {} 0.615797,\end{aligned}$$
(208)
$$\begin{aligned} \lambda ^*= & {} 1.281621. \end{aligned}$$
(209)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Davoodi, A.G., Chang, S., Yoo, H.G. et al. ForestDSH: a universal hash design for discrete probability distributions. Data Min Knowl Disc 35, 748–795 (2021). https://doi.org/10.1007/s10618-020-00732-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-020-00732-6

Keywords

Navigation