Towards a Definitive Measure of Repetitiveness

Kociumaka, Tomasz; Navarro, Gonzalo; Prezza, Nicola

doi:10.1007/978-3-030-61792-9_17

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12118))

Included in the following conference series:

Latin American Symposium on Theoretical Informatics

823 Accesses
19 Citations

Abstract

Unlike in statistical compression, where Shannon’s entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel–Ziv parse are frequently used to estimate repetitiveness. Recently, a more principled measure, the size \(\gamma \) of the smallest string attractor, was introduced. The measure \(\gamma \) lower bounds all the previous relevant ones (including z), yet length-n strings can be represented and efficiently indexed within space \(O(\gamma \log \frac{n}{\gamma })\), which also upper bounds most measures (including z). While \(\gamma \) is certainly a better measure of repetitiveness than z, it is NP-complete to compute, and no \(o(\gamma \log n)\)-space representation of strings is known. In this paper, we study a smaller measure, \(\delta \le \gamma \), which can be computed in linear time. We show that \(\delta \) better captures the compressibility of repetitive strings. For every length n and every value \(\delta \ge 2\), we construct a string such that \(\gamma = \varOmega (\delta \log \frac{n}{\delta })\). Still, we show a representation of any string S in \(O(\delta \log \frac{n}{\delta })\) space that supports direct access to any character S[i] in time \(O(\log \frac{n}{\delta })\) and finds the occ occurrences of any pattern \(P[1{.\,.}m]\) in time \(O(m\log n + occ\log ^\varepsilon n)\) for any constant \(\varepsilon >0\). Further, we prove that no \(o(\delta \log n)\)-space representation exists: for every length n and every value \(2\le \delta \le n^{1-\varepsilon }\), we exhibit a string family whose elements can only be encoded in \(\varOmega (\delta \log \frac{n}{\delta })\) space. We complete our characterization of \(\delta \) by showing that, although \(\gamma \), z, and other repetitiveness measures are always \(O(\delta \log \frac{n}{\delta })\), for strings of any length n, the smallest context-free grammar can be of size \(\varOmega (\delta \log ^2 n/\log \log n)\). No such separation is known for \(\gamma \).

Part of this work was carried out during the Dagstuhl Seminar 19241, “25 Years of the Burrows–Wheeler Transform”.

T. Kociumaka—Supported by ISF grants no. 1278/16, 824/17, and 1926/19, a BSF grant no. 2018364, and an ERC grant MPM (no. 683064) under the EU’s Horizon 2020 Research and Innovation Programme.

G. Navarro—Supported in part by Fondecyt grant 1-170048, Chile; Millennium Institute for Foundational Research on Data (IMFD), Chile.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Throughout the paper, the size of data structures is measured in machine words.
2.
The most recent index [13] locates patterns in \(O(m + (occ+1)\log ^\epsilon n)\) time and \(O(\gamma \log \frac{n}{\gamma })\) space (being thus faster but still using more space).
3.
If not, we simply pad S with spurious symbols at the end; whole spurious blocks are not represented. The extra space incurred is only O(rh) for a block tree of height h. The actual construction [5] uses instead blocks of sizes \(\lfloor n/s\rfloor \) and \(\lceil n/s \rfloor \).

References

Belazzougui, D., Cording, P.H., Puglisi, S.J., Tabei, Y.: Access, rank, and select in grammar-compressed strings. In: Bansal, N., Finocchi, I. (eds.) ESA 2015. LNCS, vol. 9294, pp. 142–154. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48350-3_13
Chapter Google Scholar
Belazzougui, D., Cunial, F.: Fast label extraction in the CDAWG. In: Fici, G., Sciortino, M., Venturini, R. (eds.) SPIRE 2017. LNCS, vol. 10508, pp. 161–175. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67428-5_14
Chapter MATH Google Scholar
Belazzougui, D., Cunial, F.: Representing the suffix tree with the CDAWG. In: Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching, CPM, pp. 7:1–7:13 (2017). https://doi.org/10.4230/LIPIcs.CPM.2017.7
Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M.: Composite repetition-aware data structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds.) CPM 2015. LNCS, vol. 9133, pp. 26–39. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19929-0_3
Chapter Google Scholar
Belazzougui, D., et al.: Queries on LZ-bounded encodings. In: Proceedings of the 2015 Data Compression Conference, DCC, pp. 83–92 (2015). https://doi.org/10.1109/DCC.2015.69
Bille, P., Ettienne, M.B., Gørtz, I.L., Vildhøj, H.W.: Time-space trade-offs for Lempel-Ziv compressed indexing. Theor. Comput. Sci. 713, 66–77 (2018). https://doi.org/10.1016/j.tcs.2017.12.021
Article MathSciNet MATH Google Scholar
Bille, P., Gørtz, I.L., Sach, B., Vildhøj, H.W.: Time-space trade-offs for longest common extensions. J. Discrete Algorithms 25, 42–50 (2014). https://doi.org/10.1016/j.jda.2013.06.003
Article MathSciNet MATH Google Scholar
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings and trees. SIAM J. Comput. 44(3), 513–539 (2015). https://doi.org/10.1137/130936889
Article MathSciNet MATH Google Scholar
Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987). https://doi.org/10.1145/28869.28873
Article MathSciNet MATH Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994). https://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf
Charikar, M., et al.: The smallest grammar problem. IEEE Trans. Inf. Theory 51(7), 2554–2576 (2005). https://doi.org/10.1109/TIT.2005.850116
Article MathSciNet MATH Google Scholar
Christiansen, A.R., Ettienne, M.B.: Compressed indexing with signature grammars. In: Bender, M.A., Farach-Colton, M., Mosteiro, M.A. (eds.) LATIN 2018. LNCS, vol. 10807, pp. 331–345. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77404-6_25
Chapter Google Scholar
Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. arXiv e-prints p. 1811.12779 (2019). https://arxiv.org/abs/1811.12779
Claude, F., Navarro, G.: Self-indexed grammar-based compression. Fundam. Inform. 111(3), 313–337 (2011). https://doi.org/10.3233/FI-2011-565
Article MathSciNet MATH Google Scholar
Claude, F., Navarro, G.: Improved grammar-based compressed indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34109-0_19
Chapter Google Scholar
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: Dediu, A.-H., Martín-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 240–251. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28332-1_21
Chapter Google Scholar
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54423-1_63
Chapter Google Scholar
Gagie, T., Navarro, G., Prezza, N.: On the approximation ratio of Lempel-Ziv parsing. In: Bender, M.A., Farach-Colton, M., Mosteiro, M.A. (eds.) LATIN 2018. LNCS, vol. 10807, pp. 490–503. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-77404-6_36
Chapter Google Scholar
Gagie, T., Navarro, G., Prezza, N.: Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1), 1–54 (2020). https://doi.org/10.1145/3375890
Article MathSciNet MATH Google Scholar
Gallant, J.K.: String Compression Algorithms. Ph.D. thesis, Princeton Univ. (1982)
Google Scholar
Jez, A.: A really simple approximation of smallest grammar. Theor. Comput. Sci. 616, 141–150 (2016). https://doi.org/10.1016/j.tcs.2015.12.032
Article MathSciNet MATH Google Scholar
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 31(2), 249–260 (1987). https://doi.org/10.1147/rd.312.0249
Article MathSciNet MATH Google Scholar
Kempa, D., Kociumaka, T.: Resolution of the Burrows-Wheeler transform conjecture. arXiv e-prints p. 1910.10631 (2019). https://arxiv.org/abs/1910.10631
Kempa, D., Prezza, N.: At the roots of dictionary compression: String attractors. In: Proceedings of the 50th Annual ACM Symposium on the Theory of Computing, STOC, pp. 827–840 (2018). https://doi.org/10.1145/3188745.3188814
Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching. Theor. Comput. Sci. 298(1), 253–272 (2003). https://doi.org/10.1016/S0304-3975(02)00426-7
Article MathSciNet MATH Google Scholar
Kieffer, J.C., Yang, E.: Grammar-based codes: a new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000). https://doi.org/10.1109/18.841160
Article MathSciNet MATH Google Scholar
Kolmogorov, A.N.: Three approaches to the quantitative definition of information. Int. J. Comput. Math. 2(1–4), 157–168 (1968). https://doi.org/10.1080/00207166808803030
Article MathSciNet MATH Google Scholar
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013). https://doi.org/10.1016/j.tcs.2012.02.006
Article MathSciNet MATH Google Scholar
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976). https://doi.org/10.1109/TIT.1976.1055501
Article MathSciNet MATH Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010). https://doi.org/10.1089/cmb.2009.0169
Article MathSciNet Google Scholar
Navarro, G., Ochoa, C., Prezza, N.: On the approximation ratio of ordered parsings. arXiv e-prints p. 1803.09517 (2019). https://arxiv.org/abs/1803.09517
Navarro, G.: Compact Data Structures - A Practical Approach. Cambridge University Press, New York (2016). https://doi.org/10.1017/cbo9781316588284
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comput. Surv. 39, 1 (2007). https://doi.org/10.1145/1216370.1216372
Article MATH Google Scholar
Navarro, G., Prezza, N.: Universal compressed text indexing. Theor. Comput. Sci. 762, 41–50 (2019). https://doi.org/10.1016/j.tcs.2018.09.007
Article MathSciNet MATH Google Scholar
Nishimoto, T., Tomohiro, I., Inenaga, S., Bannai, H., Takeda, M.: Fully dynamic data structure for LCE queries in compressed space. In: Proceedings of the 41st International Symposium on Mathematical Foundations of Computer Science, MFCS, pp. 72:1–72:15 (2016). https://doi.org/10.4230/LIPIcs.MFCS.2016.72
Prezza, N.: Optimal rank and select queries on dictionary-compressed text. In: Proceedings of the 30th Annual Symposium on Combinatorial Pattern Matching, CPM, pp. 4:1–4:12 (2019). https://doi.org/10.4230/LIPIcs.CPM.2019.4
Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.D.: Sublinear algorithms for approximating string compressibility. Algorithmica 65(3), 685–709 (2013). https://doi.org/10.1007/s00453-012-9618-6
Article MathSciNet MATH Google Scholar
Rodeh, M., Pratt, V.R., Even, S.: Linear algorithm for data compression via string matching. J. ACM 28(1), 16–24 (1981). https://doi.org/10.1145/322234.322237
Article MathSciNet MATH Google Scholar
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci. 302(1–3), 211–222 (2003). https://doi.org/10.1016/S0304-3975(02)00777-6
Article MathSciNet MATH Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 398–403 (1948). https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Article MathSciNet Google Scholar
Stephens, Z.D., et al.: Big data: astronomical or genomical? PLOS Biol. 13(7), e1002195 (2015). https://doi.org/10.1371/journal.pbio.1002195
Article Google Scholar
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982). https://doi.org/10.1145/322344.322346
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
Tomasz Kociumaka
Millennium Institute for Foundational Research on Data (IMFD), Santiago, Chile
Gonzalo Navarro
Department of Computer Science, University of Chile, Santiago, Chile
Gonzalo Navarro
Department of Business and Management, Luiss Guido Carli, Rome, Italy
Nicola Prezza

Authors

Tomasz Kociumaka
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Prezza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicola Prezza .

Editor information

Editors and Affiliations

University of São Paulo, São Paulo, Brazil
Yoshiharu Kohayakawa
University of Campinas, Campinas, Brazil
Flávio Keidi Miyazawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kociumaka, T., Navarro, G., Prezza, N. (2020). Towards a Definitive Measure of Repetitiveness. In: Kohayakawa, Y., Miyazawa, F.K. (eds) LATIN 2020: Theoretical Informatics. LATIN 2021. Lecture Notes in Computer Science(), vol 12118. Springer, Cham. https://doi.org/10.1007/978-3-030-61792-9_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-61792-9_17
Published: 03 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61791-2
Online ISBN: 978-3-030-61792-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics