Skip to main content

Longest Common Prefixes with k-Errors and Applications

  • Conference paper
  • First Online:
String Processing and Information Retrieval (SPIRE 2018)

Abstract

Although real-world text datasets, such as DNA sequences, are far from being uniformly random, string searching average-case algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length n that occurs elsewhere in the string with k-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant k and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in \(\mathcal {O}(n\frac{(c\log n)^k}{k!})\) time on average, where \(c>1\) is a constant, using \(\mathcal {O}(n)\) space. Finally, we show that our technique is applicable to several algorithmic problems found in computational biology and elsewhere. The importance of our technique lies on the fact that it is the first one achieving this bound for non-constant k and using \(\mathcal {O}(n)\) space.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The reason for imposing “\(\alpha >4\)” instead of “\(\alpha >3\)” becomes clear in Sect. 3.3.

References

  1. Abboud, A., Williams, R., Yu, H.: More applications of the polynomial method to algorithm design. In: SODA, SODA 2015, pp. 218–230. Society for Industrial and Applied Mathematics (2015)

    Google Scholar 

  2. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discret. Algorithms 2(1), 53–86 (2004)

    Article  MathSciNet  Google Scholar 

  3. Alamro, H., Ayad, L.A.K., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P.: Longest common prefixes with k-mismatches and applications. In: Tjoa, A.M., Bellatreche, L., Biffl, S., van Leeuwen, J., Wiedermann, J. (eds.) SOFSEM 2018. LNCS, vol. 10706, pp. 636–649. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73117-9_45

    Chapter  Google Scholar 

  4. Alzamel, M., et al.: Efficient computation of sequence mappability. In: Gagie, T., et al. (eds.) SPIRE 2018. LNCS, vol. 11147, pp. 12–26. Springer, Cham (2018)

    Google Scholar 

  5. Alzamel, M., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P., Radoszewski, J., Sung, W.-K.: Faster algorithms for 1-mappability of a sequence. In: Gao, X., Du, H., Han, M. (eds.) COCOA 2017. LNCS, vol. 10628, pp. 109–121. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71147-8_8

    Chapter  Google Scholar 

  6. Apostolico, A., Guerra, C., Landau, G.M., Pizzi, C.: Sequence similarity measures based on bounded hamming distance. Theor. Comput. Sci. 638, 76–90 (2016). Pattern Matching, Text Data Structures and Compression

    Article  MathSciNet  Google Scholar 

  7. Apostolico, A., Guerra, C., Pizzi, C.: Alignment free sequence similarity with bounded hamming distance. In: DCC, pp. 183–192. IEEE (2014)

    Google Scholar 

  8. Bender, M.A., Farach-Colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000). https://doi.org/10.1007/10719839_9

    Chapter  Google Scholar 

  9. Bollobás, B., Letzter, S.: Longest common extension. Eur. J. Comb. 68, 242–248 (2018)

    Article  MathSciNet  Google Scholar 

  10. Charalampopoulos, P., et al.: Linear-time algorithm for long LCF with \(k\) mismatches. In: CPM. LIPIcs, vol. 105, pp. 23:1–23:16. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2018)

    Google Scholar 

  11. Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: STOC, STOC 2004, pp. 91–100. ACM (2004)

    Google Scholar 

  12. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  13. Derrien, T., et al.: Fast computation and applications of genome mappability. PLoS ONE 7(1), e30377 (2012)

    Article  Google Scholar 

  14. Eades, P., McKay, B.D.: An algorithm for generating subsets of fixed size with a strong minimal change property. Inf. Process. Lett. 19(3), 131–133 (1984)

    Article  MathSciNet  Google Scholar 

  15. Farach, M.: Optimal suffix tree construction with large alphabets. In: FOCS, pp. 137–143. IEEE Computer Society (1997)

    Google Scholar 

  16. Faro, S., Lecroq, T.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv 45(2), 13:1–13:42 (2013)

    Article  Google Scholar 

  17. Fischer, J.: Inducing the LCP-array. In: Dehne, F., Iacono, J., Sack, J.-R. (eds.) WADS 2011. LNCS, vol. 6844, pp. 374–385. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22300-6_32

    Chapter  Google Scholar 

  18. Flouri, T., Giaquinta, E., Kobert, K., Ukkonen, E.: Longest common substrings with \(k\) mismatches. Inf. Process. Lett. 115(6–8), 643–647 (2015)

    Article  MathSciNet  Google Scholar 

  19. Grabowski, S.: A note on the longest common substring with \(k\)-mismatches problem. Inf. Process. Lett. 115(6–8), 640–642 (2015)

    Article  MathSciNet  Google Scholar 

  20. Horwege, S., et al.: Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches. Nucleic Acids Res. 42(Webserver-Issue), 7–11 (2014)

    Google Scholar 

  21. Karlin, S., Ghandour, G., Ost, F., T, S., Korn, L.J.: New approaches for computer analysis of nucleic acid sequences. Proc. Natl. Acad. Sci. USA 80, 5660–5664 (1983)

    Article  Google Scholar 

  22. Kociumaka, T., Radoszewski, J., Starikovskaya, T.A.: Longest common substring with approximately \(k\) mismatches. CoRR, abs/1712.08573 (2017)

    Google Scholar 

  23. Kolpakov, R., Bana, G., Kucherov, G.: MREPS: efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Res. 31(13), 3672–3678 (2003)

    Article  Google Scholar 

  24. Kucherov, G., Tsur, D.: Improved filters for the approximate suffix-prefix overlap problem. In: Moura, E., Crochemore, M. (eds.) SPIRE 2014. LNCS, vol. 8799, pp. 139–148. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11918-2_14

    Chapter  Google Scholar 

  25. Leimeister, C., Morgenstern, B.: Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics 30(14), 2000–2008 (2014)

    Article  Google Scholar 

  26. Liang, K.-H.: Bioinformatics for Biomedical Science and Clinical Applications. Woodhead Publishing Series in Biomedicine. Woodhead Publishing (2013)

    Google Scholar 

  27. Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  Google Scholar 

  28. Manzini, G.: Longest common prefix with mismatches. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 299–310. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23826-5_29

    Chapter  Google Scholar 

  29. Navarro, G., Baeza-Yates, R.A.: A hybrid indexing method for approximate string matching. J. Discret. Algorithms 1(1), 21–49 (2000)

    MathSciNet  Google Scholar 

  30. Nong, G., Zhang, S., Chan, W.H.: Linear suffix array construction by almost pure induced-sorting. In: DCC, pp. 193–202. IEEE (2009)

    Google Scholar 

  31. Pizzi, C.: Missmax: alignment-free sequence comparison with mismatches through filtering and heuristics. Algorithms Mol. Biol. 11(1), 6 (2016)

    Article  Google Scholar 

  32. Rasmussen, K.R., Stoye, J., Myers, E.W.: Efficient \(q\)-gram filters for finding all epsilon-matches over a given length. J. Comput. Biol. 13(2), 296–308 (2006)

    Article  MathSciNet  Google Scholar 

  33. Smit, A.F.: Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev. 9(6), 657–663 (1999)

    Article  Google Scholar 

  34. Thankachan, S.V., Aluru, C., Chockalingam, S.P., Aluru, S.: Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In: Raphael, B.J. (ed.) RECOMB 2018. LNCS, vol. 10812, pp. 211–224. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-89929-9_14

    Chapter  Google Scholar 

  35. Thankachan, S.V., Apostolico, A., Aluru, S.: A provably efficient algorithm for the k-mismatch average common substring problem. J. Comput. Biol. 23(6), 472–482 (2016)

    Article  MathSciNet  Google Scholar 

  36. Thankachan, S.V., Chockalingam, S.P., Liu, Y., Apostolico, A., Aluru, S.: ALFRED: a practical method for alignment-free distance computation. J. Comput. Biol., 23(6), 452–460 (2016)

    Article  MathSciNet  Google Scholar 

  37. Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. J. Comput. Biol. 13(2), 336–350 (2006)

    Article  MathSciNet  Google Scholar 

  38. Välimäki, N., Ladra, S., Mäkinen, V.: Approximate all-pairs suffix/prefix overlaps. Inf. Comput. 213, 49–58 (2012)

    Article  MathSciNet  Google Scholar 

  39. Willard, D.E.: Log-logarithmic worst-case range queries are possible in space theta(n). Inf. Process. Lett. 17(2), 81–84 (1983)

    Article  Google Scholar 

Download references

Acknowledgements

We warmly thank Cyril Nicaud (Université Paris-Est) for useful discussions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Panagiotis Charalampopoulos .

Editor information

Editors and Affiliations

A K-Combinations Generation Process

A K-Combinations Generation Process

We generate all combinations of size K of the set [N] using the following folklore algorithm. We build a tree in a recursive way. Each node v has a label \(\ell (v) \in [N]\), apart from the root which is labeled with 0. We say that a node v represents the set \(r(v)=\{\ell (v)\} \cup \{\ell (u)|u\text { is an ancestor of }v\}\setminus \{0\}\). By construction, no two nodes will represent the same set. At the end of the process each of the \(\mathcal {O}(\genfrac(){0.0pt}1{N}{K})\) leaves will represent a distinct K-combination. The procedure works as follows:

  1. 1.

    It takes as input a node v with satellite data \(\ell (v)\) and |r(v)|.

  2. 2.

    It creates \(N-\ell (v)-K+|r(v)|\) child nodes of v, labeled \(\ell (v)+1\) to \(N-K+|r(v)|+1\).

  3. 3.

    For each newly created node u, if \(|r(u)|=|r(v)|+1<K\), the procedure recursively calls itself.

Initiating this procedure with an input node u, with \(\ell (u)=0\) and \(|r(u)|=0\), all K-combinations are generated in time \(\mathcal {O}(\genfrac(){0.0pt}1{N}{K})\) if \(K\le N/2\) as shown below. Note that each node of the tree is associated with a unique subset of [N] and neighbouring nodes only differ on whether they contain the largest element of their union.

To upper bound the time complexity, it suffices to bound the number of nodes in the tree with a single child, since the rest internal nodes are trivially upper bounded by the number of leaves. A node with label \(N-i\) will have a single child only if it is the \((K-i)\)th element added in a combination. This can happen \(\mathcal {O}(\genfrac(){0.0pt}1{N-i-1}{K-i-1})\) times. Given our assumption that \(K\le N/2\) and using Pascal’s identity \(\genfrac(){0.0pt}1{m}{j}=\genfrac(){0.0pt}1{m-1}{j-1}+\genfrac(){0.0pt}1{m-1}{j}\), we bound the number of nodes with a single child as follows:

$$\begin{aligned}&\sum _{i=1}^{K-1}{\genfrac(){0.0pt}1{N-i-1}{K-i-1}}= 1+\sum _{i=1}^{K-2}{\genfrac(){0.0pt}1{N-i}{K-i-1}}-\sum _{i=1}^{K-2}{\genfrac(){0.0pt}1{N-i-1}{K-i-2}}\\&=1+\sum _{i=1}^{K-2}{\genfrac(){0.0pt}1{N-i}{K-i-1}}-\sum _{i=2}^{K-1}{\genfrac(){0.0pt}1{N-i}{K-i-1}}= 1+\genfrac(){0.0pt}1{N-1}{K-2}-1= \frac{K(K-1)}{N(N-K+1)}\genfrac(){0.0pt}1{N}{K}\le \genfrac(){0.0pt}1{N}{K}. \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ayad, L.A.K., Barton, C., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P. (2018). Longest Common Prefixes with k-Errors and Applications. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds) String Processing and Information Retrieval. SPIRE 2018. Lecture Notes in Computer Science(), vol 11147. Springer, Cham. https://doi.org/10.1007/978-3-030-00479-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00479-8_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00478-1

  • Online ISBN: 978-3-030-00479-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics