Indexing Highly Repetitive Collections

Navarro, Gonzalo

doi:10.1007/978-3-642-35926-2_29

Gonzalo Navarro¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7643))

Included in the following conference series:

International Workshop on Combinatorial Algorithms

889 Accesses
17 Citations

Abstract

The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to address the problem: compressed suffix arrays, grammar compressed indexes, and Lempel-Ziv compressed indexes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abeliuk, A., Navarro, G.: Compressed Suffix Trees for Repetitive Texts. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 30–41. Springer, Heidelberg (2012)
Chapter Google Scholar
Bille, P., Landau, G., Raman, R., Sadakane, K., Rao Satti, S., Weimann, O.: Random access to grammar-compressed strings. In: Proc. 22nd SODA, pp. 373–389 (2011)
Google Scholar
Chan, T., Larsen, K., Patrascu, M.: Orthogonal range searching on the RAM, revisited. In: Proc. 27th SoCG, pp. 1–10 (2011)
Google Scholar
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Rasala, A., Sahai, A., Shelat, A.: Approximating the smallest grammar: Kolmogorov complexity in natural models. In: Proc. 34th STOC, pp. 792–801 (2002)
Google Scholar
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Trans. Inf. Theo. 51(7), 2554–2576 (2005)
Article MathSciNet Google Scholar
Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Compressed q-gram indexing for highly repetitive biological sequences. In: Proc. 10th BIBE, pp. 86–91 (2010)
Google Scholar
Claude, F., Fariña, A., Martínez-Prieto, M., Navarro, G.: Indexes for highly repetitive document collections. In: Proc. 20th CIKM, pp. 463–468 (2011)
Google Scholar
Claude, F., Navarro, G.: Improved Grammar-Based Compressed Indexes. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 180–192. Springer, Heidelberg (2012)
Chapter Google Scholar
Do, H.-H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast relative Lempel-Ziv self-index for similar sequences. In: Proc. FAW-AAIM, pp. 291–302 (2012)
Google Scholar
Fischer, J., Mäkinen, V., Navarro, G.: Faster entropy-bounded compressed suffix trees. Theor. Comp. Sci. 410(51), 5354–5364 (2009)
Article MATH Google Scholar
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A Faster Grammar-Based Self-index. In: Dediu, A.-H., Martín-Vide, C. (eds.) LATA 2012. LNCS, vol. 7183, pp. 240–251. Springer, Heidelberg (2012)
Chapter Google Scholar
Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proc. 14th SODA, pp. 841–850 (2003)
Google Scholar
Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comp. 35(2), 378–407 (2006)
Article MathSciNet Google Scholar
Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing Similar DNA Sequences. In: Chen, B. (ed.) AAIM 2010. LNCS, vol. 6124, pp. 180–190. Springer, Heidelberg (2010)
Chapter Google Scholar
Kärkkäinen, J.: Repetition-Based Text Indexing. PhD thesis, Dept of Comp. Sci., Univ. of Helsinki, Finland (1999)
Google Scholar
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comp. Sci. (to appear, 2012); Earlier versions in Proc. DCC 2010 and Proc. CPM 2011
Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comp. Biol. 17(3), 281–308 (2010)
Article Google Scholar
Manber, U., Myers, E.: Suffix arrays: a new method for on-line string searches. SIAM J. Comp., 935–948 (1993)
Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. J. ACM 48(3), 407–430 (2001)
Article MathSciNet Google Scholar
Maruyama, S., Nakahara, M., Kishiue, N., Sakamoto, H.: ESP-Index: A Compressed Index Based on Edit-Sensitive Parsing. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds.) SPIRE 2011. LNCS, vol. 7024, pp. 398–409. Springer, Heidelberg (2011)
Chapter Google Scholar
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Comp. Surv. 39(1), article 2 (2007)
Google Scholar
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theo. Comp. Sci. 302(1-3), 211–222 (2003)
Article MathSciNet MATH Google Scholar
Sadakane, K.: New text indexing functionalities of the compressed suffix arrays. J. Alg. 48(2), 294–313 (2003)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theo. 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theo. 24(5), 530–536 (1978)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, University of Chile, Chile
Gonzalo Navarro

Authors

Gonzalo Navarro
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Kalasalingam University, Anand Nagar, 626 126, Krishanakoil, Tamil Nadu, India
S. Arumugam
Algorithms Research Group, Department of Computing & Software, McMaster University, L8S 4K1, Hamilton, ON, Canada
W. F. Smyth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Navarro, G. (2012). Indexing Highly Repetitive Collections. In: Arumugam, S., Smyth, W.F. (eds) Combinatorial Algorithms. IWOCA 2012. Lecture Notes in Computer Science, vol 7643. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35926-2_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-35926-2_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35925-5
Online ISBN: 978-3-642-35926-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics