Abstract
We consider a subset matching variant of the Dictionary Query problem. Consider a dictionary D of n strings, where each string location contains a set of characters drawn from some alphabet Σ = {1,...,|Σ|}. Our goal is to preprocess D so when given a query pattern p, where each location in p contains a single character from Σ, we answer if p matches to D. p is said to match to D if there is some s ∈ D where |p| = |s| and p[i] ∈ s[i] for every 1 ≤ i ≤ |p|.
To achieve a query time of O(|p|), we construct a compressed trie of all possible patterns that appear in D. Assuming that for every s ∈ D there are at most k locations where |s[i]| > 1, we present two constructions of the trie that yield a preprocessing time of O(nm + |Σ|k n log( min {n,m})), where n is the number of strings in D and m is the maximum length of a string in D. The first construction is based on divide and conquer and the second construction uses ideas introduced in [2] for text fingerprinting. Furthermore, we show how to obtain O(nm + |Σ|k n + |Σ|k/2 nlog( min {n,m})) preprocessing time and O(|p|loglog|Σ| + min {|p|,log(|Σ|k n)}loglog(|Σ|k n)) query time by cutting the dictionary strings and constructing two compressed tries.
Our problem is motivated by haplotype inference from a library of genotypes [13,16]. There, D is a known library of genotypes (|Σ| = 2), and p is a haplotype. Indexing all possible haplotypes that can be inferred from D as well as gathering statistical information about them can be used to accelerate various haplotype inference algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abecasis, G.R., Martin, R., Lewitzky, S.: Estimation of haplotype frequencies from diploid data. American Journal of Human Genetics, 69(4 Suppl. 1):114 (2001)
Amir, A., Apostolico, A., Landau, G.M., Satta, G.: Efficient text fingerprinting via parikh mapping. J. of Discrete Algorithms 1(5-6), 409–421 (2003)
Chazelle, B., Guibas, L.J.: Fractional cascading: I. a data structuring technique. Algorithmica 1(2), 133–162 (1986)
Clark, A.G.: Inference of haplotypes from PCR-amplified samples of diploid population. Molecular Biology and Evolution 7(2), 111–122 (1990)
Cole, R., Gottlieb, L., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proc. 36th ACM Symposium on Theory of Computing (STOC), pp. 91–100 (2004)
Cole, R., Hariharan, R.: Verifying candidate matches in sparse and wildcard matching. In: Proc. 34th ACM Symposium on Theory of Computing (STOC), pp. 592–601 (2002)
Cole, R., Kopelowitz, T., Lewenstein, M.: Suffix trays and suffix trists: structures for faster text indexing. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds.) ICALP 2006. LNCS, vol. 4051, pp. 358–369. Springer, Heidelberg (2006)
Didier, G., Schmidt, T., Stoye, J., Tsur, D.: Character sets of strings. J. of Discrete Algorithms 5(2), 330–340 (2007)
Excoffier, L., Slatkin, M.: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Molecular Biology and Evolution 12(5), 921–927 (1995)
Fallin, D., Schork, N.J.: Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. American Journal of Human Genetics 67(4), 947–959 (2000)
Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. of the ACM 47(6), 987–1011 (2000)
Gusfield, D.: Haplotype inference by pure parsimony. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 144–155. Springer, Heidelberg (2003)
Gusfield, D., Orzack, S.H.: Haplotype inference. In: Aluru, S. (ed.) CRC handbook on bioinformatics (2005)
Hagerup, T., Miltersen, P.B., Pagh, R.: Deterministic dictionaries. J. of Algorithms 41(1), 69–85 (2001)
Hajiaghayi, M.T., Jain, K., Konwar, K., Lau, L.C., Mandoiu, I.I., Vazirani, V.V.: Minimum multicolored subgraph problem in multiplex PCR primer set selection and population haplotyping. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 758–766. Springer, Heidelberg (2006)
Halldórsson, B.V., Bafna, V., Edwards, N., Lippert, R., Yooseph, S., Istrail, S.: A survey of computational methods for determining haplotypes. In: Istrail, S., Waterman, M.S., Clark, A. (eds.) DIMACS/RECOMB Satellite Workshop 2002. LNCS (LNBI), vol. 2983, pp. 26–47. Springer, Heidelberg (2002)
Halperin, E., Karp, R.M.: The minimum-entropy set cover problem. In: Díaz, J., Karhumäki, J., Lepistö, A., Sannella, D. (eds.) ICALP 2004. LNCS, vol. 3142, pp. 733–744. Springer, Heidelberg (2004)
Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM J. on Computing 13(2), 338–355 (1984)
Hawley, M.E., Kidd, K.K.: Haplo: A program using the em algorithm to estimate the frequencies of multi-site haplotypes. J. of Heredity 86, 409–411 (1995)
Helmuth, L.: Genome research: Map of human genome 3.0. Science 5530(293), 583–585 (2001)
Indyk, P.: Faster algorithms for string matching problems: Matching the convolution bound. In: Proc. 39th Symposium on Foundations of Computer Science (FOCS), pp. 166–173 (1998)
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. of the ACM 53(6), 918–936 (2006)
Kolpakov, R., Raffinot, M.: New algorithms for text fingerprinting. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 342–353. Springer, Heidelberg (2006)
Long, J.C., Williams, R.C., Urbanek, M.: An E-M algorithm and testing strategy for multiple-locus haplotypes. American Journal of Human Genetics 56(2), 799–810 (1995)
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. of the ACM 23, 262–272 (1976)
Rastas, P., Koivisto, M., Mannila, H., Ukkonen, E.: A hidden markov technique for haplotype reconstruction. In: Casadio, R., Myers, G. (eds.) WABI 2005. LNCS (LNBI), vol. 3692, pp. 140–151. Springer, Heidelberg (2005)
Rastas, P., Ukkonen, E.: Haplotype inference via hierarchical genotype parsing. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 85–97. Springer, Heidelberg (2007)
Shi, Q., JáJá, J.: Novel transformation techniques using Q-heaps with applications to computational geometry. SIAM J. on Computing 34(6), 1471–1492 (2005)
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 246–260 (1995)
van Emde Boas, P.: Preserving order in a forest in less than logarithmic time and linear space. Information Processing Letters 6(3), 80–82 (1977)
Weiner, P.: Linear pattern matching algorithm. In: Proc. 14th IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Zhang, P., Sheng, H., Morabia, A., Gilliam, T.C.: Optimal step length EM algorithm (OSLEM) for the estimation of haplotype frequency and its application in lipoprotein lipase genotyping. BMC Bioinformatics 4(3) (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Landau, G.M., Tsur, D., Weimann, O. (2010). Indexing a Dictionary for Subset Matching Queries. In: Elomaa, T., Mannila, H., Orponen, P. (eds) Algorithms and Applications. Lecture Notes in Computer Science, vol 6060. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12476-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-12476-1_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12475-4
Online ISBN: 978-3-642-12476-1
eBook Packages: Computer ScienceComputer Science (R0)