Longest Common Prefixes with k-Errors and Applications

Ayad, Lorraine A. K.; Barton, Carl; Charalampopoulos, Panagiotis; Iliopoulos, Costas S.; Pissis, Solon P.

doi:10.1007/978-3-030-00479-8_3

Lorraine A. K. Ayad¹⁷,
Carl Barton¹⁸,
Panagiotis Charalampopoulos¹⁷,
Costas S. Iliopoulos¹⁷ &
…
Solon P. Pissis¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11147))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

830 Accesses
7 Citations

Abstract

Although real-world text datasets, such as DNA sequences, are far from being uniformly random, string searching average-case algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length n that occurs elsewhere in the string with k-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant k and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in $\mathcal {O}(n\frac{(c\log n)^k}{k!})$ time on average, where $c>1$ is a constant, using $\mathcal {O}(n)$ space. Finally, we show that our technique is applicable to several algorithmic problems found in computational biology and elsewhere. The importance of our technique lies on the fact that it is the first one achieving this bound for non-constant k and using $\mathcal {O}(n)$ space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The reason for imposing “$\alpha >4$” instead of “$\alpha >3$” becomes clear in Sect. 3.3.

References

Abboud, A., Williams, R., Yu, H.: More applications of the polynomial method to algorithm design. In: SODA, SODA 2015, pp. 218–230. Society for Industrial and Applied Mathematics (2015)
Google Scholar
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discret. Algorithms 2(1), 53–86 (2004)
Article MathSciNet Google Scholar
Alamro, H., Ayad, L.A.K., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P.: Longest common prefixes with k-mismatches and applications. In: Tjoa, A.M., Bellatreche, L., Biffl, S., van Leeuwen, J., Wiedermann, J. (eds.) SOFSEM 2018. LNCS, vol. 10706, pp. 636–649. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73117-9_45
Chapter Google Scholar
Alzamel, M., et al.: Efficient computation of sequence mappability. In: Gagie, T., et al. (eds.) SPIRE 2018. LNCS, vol. 11147, pp. 12–26. Springer, Cham (2018)
Google Scholar
Alzamel, M., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P., Radoszewski, J., Sung, W.-K.: Faster algorithms for 1-mappability of a sequence. In: Gao, X., Du, H., Han, M. (eds.) COCOA 2017. LNCS, vol. 10628, pp. 109–121. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71147-8_8
Chapter Google Scholar
Apostolico, A., Guerra, C., Landau, G.M., Pizzi, C.: Sequence similarity measures based on bounded hamming distance. Theor. Comput. Sci. 638, 76–90 (2016). Pattern Matching, Text Data Structures and Compression
Article MathSciNet Google Scholar
Apostolico, A., Guerra, C., Pizzi, C.: Alignment free sequence similarity with bounded hamming distance. In: DCC, pp. 183–192. IEEE (2014)
Google Scholar
Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000). https://doi.org/10.1007/10719839_9
Chapter Google Scholar
Bollobás, B., Letzter, S.: Longest common extension. Eur. J. Comb. 68, 242–248 (2018)
Article MathSciNet Google Scholar
Charalampopoulos, P., et al.: Linear-time algorithm for long LCF with $k$ mismatches. In: CPM. LIPIcs, vol. 105, pp. 23:1–23:16. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2018)
Google Scholar
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC, STOC 2004, pp. 91–100. ACM (2004)
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)
MATH Google Scholar
Derrien, T., et al.: Fast computation and applications of genome mappability. PLoS ONE 7(1), e30377 (2012)
Article Google Scholar
Eades, P., McKay, B.D.: An algorithm for generating subsets of fixed size with a strong minimal change property. Inf. Process. Lett. 19(3), 131–133 (1984)
Article MathSciNet Google Scholar
Farach, M.: Optimal suffix tree construction with large alphabets. In: FOCS, pp. 137–143. IEEE Computer Society (1997)
Google Scholar
Faro, S., Lecroq, T.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv 45(2), 13:1–13:42 (2013)
Article Google Scholar
Fischer, J.: Inducing the LCP-array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22300-6_32
Chapter Google Scholar
Flouri, T., Giaquinta, E., Kobert, K., Ukkonen, E.: Longest common substrings with $k$ mismatches. Inf. Process. Lett. 115(6–8), 643–647 (2015)
Article MathSciNet Google Scholar
Grabowski, S.: A note on the longest common substring with $k$-mismatches problem. Inf. Process. Lett. 115(6–8), 640–642 (2015)
Article MathSciNet Google Scholar
Horwege, S., et al.: Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res. 42(Webserver-Issue), 7–11 (2014)
Google Scholar
Karlin, S., Ghandour, G., Ost, F., T, S., Korn, L.J.: New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. USA 80, 5660–5664 (1983)
Article Google Scholar
Kociumaka, T., Radoszewski, J., Starikovskaya, T.A.: Longest common substring with approximately $k$ mismatches. CoRR, abs/1712.08573 (2017)
Google Scholar
Kolpakov, R., Bana, G., Kucherov, G.: MREPS: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. 31(13), 3672–3678 (2003)
Article Google Scholar
Kucherov, G., Tsur, D.: Improved filters for the approximate suffix-prefix overlap problem. In: Moura, E., Crochemore, M. (eds.) SPIRE 2014. LNCS, vol. 8799, pp. 139–148. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11918-2_14
Chapter Google Scholar
Leimeister, C., Morgenstern, B.: Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30(14), 2000–2008 (2014)
Article Google Scholar
Liang, K.-H.: Bioinformatics for Biomedical Science and Clinical Applications. Woodhead Publishing Series in Biomedicine. Woodhead Publishing (2013)
Google Scholar
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet Google Scholar
Manzini, G.: Longest common prefix with mismatches. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 299–310. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23826-5_29
Chapter Google Scholar
Navarro, G., Baeza-Yates, R.A.: A hybrid indexing method for approximate string matching. J. Discret. Algorithms 1(1), 21–49 (2000)
MathSciNet Google Scholar
Nong, G., Zhang, S., Chan, W.H.: Linear suffix array construction by almost pure induced-sorting. In: DCC, pp. 193–202. IEEE (2009)
Google Scholar
Pizzi, C.: Missmax: alignment-free sequence comparison with mismatches through filtering and heuristics. Algorithms Mol. Biol. 11(1), 6 (2016)
Article Google Scholar
Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient $q$-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13(2), 296–308 (2006)
Article MathSciNet Google Scholar
Smit, A.F.: Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev. 9(6), 657–663 (1999)
Article Google Scholar
Thankachan, S.V., Aluru, C., Chockalingam, S.P., Aluru, S.: Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In: Raphael, B.J. (ed.) RECOMB 2018. LNCS, vol. 10812, pp. 211–224. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89929-9_14
Chapter Google Scholar
Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the k-mismatch average common substring problem. J. Comput. Biol. 23(6), 472–482 (2016)
Article MathSciNet Google Scholar
Thankachan, S.V., Chockalingam, S.P., Liu, Y., Apostolico, A., Aluru, S.: ALFRED: a practical method for alignment-free distance computation. J. Comput. Biol., 23(6), 452–460 (2016)
Article MathSciNet Google Scholar
Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 13(2), 336–350 (2006)
Article MathSciNet Google Scholar
Välimäki, N., Ladra, S., Mäkinen, V.: Approximate all-pairs suffix/prefix overlaps. Inf. Comput. 213, 49–58 (2012)
Article MathSciNet Google Scholar
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space theta(n). Inf. Process. Lett. 17(2), 81–84 (1983)
Article Google Scholar

Download references

Acknowledgements

We warmly thank Cyril Nicaud (Université Paris-Est) for useful discussions.

Author information

Authors and Affiliations

Department of Informatics, King’s College London, London, UK
Lorraine A. K. Ayad, Panagiotis Charalampopoulos, Costas S. Iliopoulos & Solon P. Pissis
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, UK
Carl Barton

Authors

Lorraine A. K. Ayad
View author publications
You can also search for this author in PubMed Google Scholar
Carl Barton
View author publications
You can also search for this author in PubMed Google Scholar
Panagiotis Charalampopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Costas S. Iliopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Solon P. Pissis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Panagiotis Charalampopoulos .

Editor information

Editors and Affiliations

Diego Portales University, Santiago, Chile
Travis Gagie
The University of Melbourne, Melbourne, VIC, Australia
Alistair Moffat
University of Chile, Santiago, Chile
Gonzalo Navarro
Universidad de Ingeniería y Tecnología, Lima, Peru
Ernesto Cuadros-Vargas

A K-Combinations Generation Process

We generate all combinations of size K of the set [N] using the following folklore algorithm. We build a tree in a recursive way. Each node v has a label $\ell (v) \in [N]$, apart from the root which is labeled with 0. We say that a node v represents the set $r(v)=\{\ell (v)\} \cup \{\ell (u)|u\text { is an ancestor of }v\}\setminus \{0\}$. By construction, no two nodes will represent the same set. At the end of the process each of the $\mathcal {O}(\genfrac(){0.0pt}1{N}{K})$ leaves will represent a distinct K-combination. The procedure works as follows:

1.
It takes as input a node v with satellite data $\ell (v)$ and |r(v)|.
2.
It creates $N-\ell (v)-K+|r(v)|$ child nodes of v, labeled $\ell (v)+1$ to $N-K+|r(v)|+1$.
3.
For each newly created node u, if $|r(u)|=|r(v)|+1<K$, the procedure recursively calls itself.

Initiating this procedure with an input node u, with $\ell (u)=0$ and $|r(u)|=0$, all K-combinations are generated in time $\mathcal {O}(\genfrac(){0.0pt}1{N}{K})$ if $K\le N/2$ as shown below. Note that each node of the tree is associated with a unique subset of [N] and neighbouring nodes only differ on whether they contain the largest element of their union.

To upper bound the time complexity, it suffices to bound the number of nodes in the tree with a single child, since the rest internal nodes are trivially upper bounded by the number of leaves. A node with label $N-i$ will have a single child only if it is the $(K-i)$th element added in a combination. This can happen $\mathcal {O}(\genfrac(){0.0pt}1{N-i-1}{K-i-1})$ times. Given our assumption that $K\le N/2$ and using Pascal’s identity $\genfrac(){0.0pt}1{m}{j}=\genfrac(){0.0pt}1{m-1}{j-1}+\genfrac(){0.0pt}1{m-1}{j}$, we bound the number of nodes with a single child as follows:

$$\begin{aligned}&\sum _{i=1}^{K-1}{\genfrac(){0.0pt}1{N-i-1}{K-i-1}}= 1+\sum _{i=1}^{K-2}{\genfrac(){0.0pt}1{N-i}{K-i-1}}-\sum _{i=1}^{K-2}{\genfrac(){0.0pt}1{N-i-1}{K-i-2}}\\&=1+\sum _{i=1}^{K-2}{\genfrac(){0.0pt}1{N-i}{K-i-1}}-\sum _{i=2}^{K-1}{\genfrac(){0.0pt}1{N-i}{K-i-1}}= 1+\genfrac(){0.0pt}1{N-1}{K-2}-1= \frac{K(K-1)}{N(N-K+1)}\genfrac(){0.0pt}1{N}{K}\le \genfrac(){0.0pt}1{N}{K}. \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ayad, L.A.K., Barton, C., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P. (2018). Longest Common Prefixes with k-Errors and Applications. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds) String Processing and Information Retrieval. SPIRE 2018. Lecture Notes in Computer Science(), vol 11147. Springer, Cham. https://doi.org/10.1007/978-3-030-00479-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-00479-8_3
Published: 14 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00478-1
Online ISBN: 978-3-030-00479-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Longest Common Prefixes with k-Errors and Applications

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A K-Combinations Generation Process

A K-Combinations Generation Process

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation