Skip to main content

Towards a Definitive Measure of Repetitiveness

  • Conference paper
  • First Online:
LATIN 2020: Theoretical Informatics (LATIN 2021)

Abstract

Unlike in statistical compression, where Shannon’s entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size \(\gamma \) of the smallest string attractor, was introduced. The measure \(\gamma \) lower bounds all the previous relevant ones (including z), yet length-n strings can be represented and efficiently indexed within space \(O(\gamma \log \frac{n}{\gamma })\), which also upper bounds most measures (including z). While \(\gamma \) is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and no \(o(\gamma \log n)\)-space representation of strings is known. In this paper, we study a smaller measure, \(\delta \le \gamma \), which can be computed in linear time. We show that \(\delta \) better captures the compressibility of repetitive strings. For every length n and every value \(\delta \ge 2\), we construct a string such that \(\gamma = \varOmega (\delta \log \frac{n}{\delta })\). Still, we show a representation of any string S in \(O(\delta \log \frac{n}{\delta })\) space that supports direct access to any character S[i] in time \(O(\log \frac{n}{\delta })\) and finds the occ occurrences of any pattern \(P[1{.\,.}m]\) in time \(O(m\log n + occ\log ^\varepsilon n)\) for any constant \(\varepsilon >0\). Further, we prove that no \(o(\delta \log n)\)-space representation exists: for every length n and every value \(2\le \delta \le n^{1-\varepsilon }\), we exhibit a string family whose elements can only be encoded in \(\varOmega (\delta \log \frac{n}{\delta })\) space. We complete our characterization of \(\delta \) by showing that, although \(\gamma \), z, and other repetitiveness measures are always \(O(\delta \log \frac{n}{\delta })\), for strings of any length n, the smallest context-free grammar can be of size \(\varOmega (\delta \log ^2 n/\log \log n)\). No such separation is known for \(\gamma \).

Part of this work was carried out during the Dagstuhl Seminar 19241, “25 Years of the Burrows–Wheeler Transform”.

T. Kociumaka—Supported by ISF grants no. 1278/16, 824/17, and 1926/19, a BSF grant no. 2018364, and an ERC grant MPM (no. 683064) under the EU’s Horizon 2020 Research and Innovation Programme.

G. Navarro—Supported in part by Fondecyt grant 1-170048, Chile; Millennium Institute for Foundational Research on Data (IMFD), Chile.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Throughout the paper, the size of data structures is measured in machine words.

  2. 2.

    The most recent index  [13] locates patterns in \(O(m + (occ+1)\log ^\epsilon n)\) time and \(O(\gamma \log \frac{n}{\gamma })\) space (being thus faster but still using more space).

  3. 3.

    If not, we simply pad S with spurious symbols at the end; whole spurious blocks are not represented. The extra space incurred is only O(rh) for a block tree of height h. The actual construction  [5] uses instead blocks of sizes \(\lfloor n/s\rfloor \) and \(\lceil n/s \rfloor \).

References

  1. Belazzougui, D., Cording, P.H., Puglisi, S.J., Tabei, Y.: Access, rank, and select in grammar-compressed strings. In: Bansal, N., Finocchi, I. (eds.) ESA 2015. LNCS, vol. 9294, pp. 142–154. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48350-3_13

    Chapter  Google Scholar 

  2. Belazzougui, D., Cunial, F.: Fast label extraction in the CDAWG. In: Fici, G., Sciortino, M., Venturini, R. (eds.) SPIRE 2017. LNCS, vol. 10508, pp. 161–175. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67428-5_14

    Chapter  MATH  Google Scholar 

  3. Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching, CPM, pp. 7:1–7:13 (2017). https://doi.org/10.4230/LIPIcs.CPM.2017.7

  4. Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19929-0_3

    Chapter  Google Scholar 

  5. Belazzougui, D., et al.: Queries on LZ-bounded encodings. In: Proceedings of the 2015 Data Compression Conference, DCC, pp. 83–92 (2015). https://doi.org/10.1109/DCC.2015.69

  6. Bille, P., Ettienne, M.B., Gørtz, I.L., Vildhøj, H.W.: Time-space trade-offs for Lempel-Ziv compressed indexing. Theor. Comput. Sci. 713, 66–77 (2018). https://doi.org/10.1016/j.tcs.2017.12.021

    Article  MathSciNet  MATH  Google Scholar 

  7. Bille, P., Gørtz, I.L., Sach, B., Vildhøj, H.W.: Time-space trade-offs for longest common extensions. J. Discrete Algorithms 25, 42–50 (2014). https://doi.org/10.1016/j.jda.2013.06.003

    Article  MathSciNet  MATH  Google Scholar 

  8. Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 44(3), 513–539 (2015). https://doi.org/10.1137/130936889

    Article  MathSciNet  MATH  Google Scholar 

  9. Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987). https://doi.org/10.1145/28869.28873

    Article  MathSciNet  MATH  Google Scholar 

  10. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994). https://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf

  11. Charikar, M., et al.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005). https://doi.org/10.1109/TIT.2005.850116

    Article  MathSciNet  MATH  Google Scholar 

  12. Christiansen, A.R., Ettienne, M.B.: Compressed indexing with signature grammars. In: Bender, M.A., Farach-Colton, M., Mosteiro, M.A. (eds.) LATIN 2018. LNCS, vol. 10807, pp. 331–345. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77404-6_25

    Chapter  Google Scholar 

  13. Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. arXiv e-prints p. 1811.12779 (2019). https://arxiv.org/abs/1811.12779

  14. Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundam. Inform. 111(3), 313–337 (2011). https://doi.org/10.3233/FI-2011-565

    Article  MathSciNet  MATH  Google Scholar 

  15. Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_19

    Chapter  Google Scholar 

  16. Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: Dediu, A.-H., Martín-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 240–251. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28332-1_21

    Chapter  Google Scholar 

  17. Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54423-1_63

    Chapter  Google Scholar 

  18. Gagie, T., Navarro, G., Prezza, N.: On the approximation ratio of Lempel-Ziv parsing. In: Bender, M.A., Farach-Colton, M., Mosteiro, M.A. (eds.) LATIN 2018. LNCS, vol. 10807, pp. 490–503. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77404-6_36

    Chapter  Google Scholar 

  19. Gagie, T., Navarro, G., Prezza, N.: Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1), 1–54 (2020). https://doi.org/10.1145/3375890

    Article  MathSciNet  MATH  Google Scholar 

  20. Gallant, J.K.: String Compression Algorithms. Ph.D. thesis, Princeton Univ. (1982)

    Google Scholar 

  21. Jez, A.: A really simple approximation of smallest grammar. Theor. Comput. Sci. 616, 141–150 (2016). https://doi.org/10.1016/j.tcs.2015.12.032

    Article  MathSciNet  MATH  Google Scholar 

  22. Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987). https://doi.org/10.1147/rd.312.0249

    Article  MathSciNet  MATH  Google Scholar 

  23. Kempa, D., Kociumaka, T.: Resolution of the Burrows-Wheeler transform conjecture. arXiv e-prints p. 1910.10631 (2019). https://arxiv.org/abs/1910.10631

  24. Kempa, D., Prezza, N.: At the roots of dictionary compression: String attractors. In: Proceedings of the 50th Annual ACM Symposium on the Theory of Computing, STOC, pp. 827–840 (2018). https://doi.org/10.1145/3188745.3188814

  25. Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching. Theor. Comput. Sci. 298(1), 253–272 (2003). https://doi.org/10.1016/S0304-3975(02)00426-7

    Article  MathSciNet  MATH  Google Scholar 

  26. Kieffer, J.C., Yang, E.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000). https://doi.org/10.1109/18.841160

    Article  MathSciNet  MATH  Google Scholar 

  27. Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Int. J. Comput. Math. 2(1–4), 157–168 (1968). https://doi.org/10.1080/00207166808803030

    Article  MathSciNet  MATH  Google Scholar 

  28. Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013). https://doi.org/10.1016/j.tcs.2012.02.006

    Article  MathSciNet  MATH  Google Scholar 

  29. Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976). https://doi.org/10.1109/TIT.1976.1055501

    Article  MathSciNet  MATH  Google Scholar 

  30. Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010). https://doi.org/10.1089/cmb.2009.0169

    Article  MathSciNet  Google Scholar 

  31. Navarro, G., Ochoa, C., Prezza, N.: On the approximation ratio of ordered parsings. arXiv e-prints p. 1803.09517 (2019). https://arxiv.org/abs/1803.09517

  32. Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, New York (2016). https://doi.org/10.1017/cbo9781316588284

  33. Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39, 1 (2007). https://doi.org/10.1145/1216370.1216372

    Article  MATH  Google Scholar 

  34. Navarro, G., Prezza, N.: Universal compressed text indexing. Theor. Comput. Sci. 762, 41–50 (2019). https://doi.org/10.1016/j.tcs.2018.09.007

    Article  MathSciNet  MATH  Google Scholar 

  35. Nishimoto, T., Tomohiro, I., Inenaga, S., Bannai, H., Takeda, M.: Fully dynamic data structure for LCE queries in compressed space. In: Proceedings of the 41st International Symposium on Mathematical Foundations of Computer Science, MFCS, pp. 72:1–72:15 (2016). https://doi.org/10.4230/LIPIcs.MFCS.2016.72

  36. Prezza, N.: Optimal rank and select queries on dictionary-compressed text. In: Proceedings of the 30th Annual Symposium on Combinatorial Pattern Matching, CPM, pp. 4:1–4:12 (2019). https://doi.org/10.4230/LIPIcs.CPM.2019.4

  37. Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.D.: Sublinear algorithms for approximating string compressibility. Algorithmica 65(3), 685–709 (2013). https://doi.org/10.1007/s00453-012-9618-6

    Article  MathSciNet  MATH  Google Scholar 

  38. Rodeh, M., Pratt, V.R., Even, S.: Linear algorithm for data compression via string matching. J. ACM 28(1), 16–24 (1981). https://doi.org/10.1145/322234.322237

    Article  MathSciNet  MATH  Google Scholar 

  39. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003). https://doi.org/10.1016/S0304-3975(02)00777-6

    Article  MathSciNet  MATH  Google Scholar 

  40. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 398–403 (1948). https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

    Article  MathSciNet  Google Scholar 

  41. Stephens, Z.D., et al.: Big data: astronomical or genomical? PLOS Biol. 13(7), e1002195 (2015). https://doi.org/10.1371/journal.pbio.1002195

    Article  Google Scholar 

  42. Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982). https://doi.org/10.1145/322344.322346

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicola Prezza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kociumaka, T., Navarro, G., Prezza, N. (2020). Towards a Definitive Measure of Repetitiveness. In: Kohayakawa, Y., Miyazawa, F.K. (eds) LATIN 2020: Theoretical Informatics. LATIN 2021. Lecture Notes in Computer Science(), vol 12118. Springer, Cham. https://doi.org/10.1007/978-3-030-61792-9_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-61792-9_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-61791-2

  • Online ISBN: 978-3-030-61792-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics