Skip to main content

Reducing Space Requirements for Disk Resident Suffix Arrays

  • Conference paper
Database Systems for Advanced Applications (DASFAA 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5463))

Included in the following conference series:

Abstract

Suffix trees and suffix arrays are important data structures for string processing, providing efficient solutions for many applications involving pattern matching. Recent work by Sinha et al. (SIGMOD 2008) addressed the problem of arranging a suffix array on disk so that querying is fast, and showed that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered many times faster than alternative disk-based suffix trees. A drawback of their LOF-SA structure, and common to all current disk resident suffix tree/array approaches, is that the space requirement of the data structure, though on disk, is large relative to the text – for the LOF-SA, 13n bytes including the underlying n byte text. In this paper we explore techniques for reducing the space required by the LOF-SA. Experiments show these methods cut the data structure to nearly half its original size, without, for large strings that necessitate on-disk structures, any impact on search times.

This work was supported by the Australian Research Council.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  2. Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Information Retrieval 8(1), 151–166 (2005)

    Article  Google Scholar 

  3. Apostolico, A.: The myriad virtues of subword trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words. NATO ASI Series F12, pp. 85–96. Springer, Berlin (1985)

    Chapter  Google Scholar 

  4. Baeza-Yates, R.A., Barbosa, E.F., Ziviani, N.: Hierarchies of indices for text searching. Information Systems 21(6), 497–514 (1996)

    Article  Google Scholar 

  5. Brisaboa, N.R., Fariña, A., Navarro, G., Esteller, M.F. (S,C)-dense coding: An optimized compression code for natural language text databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 122–136. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  6. Cheung, C.-F., Xu Yu, J., Lu, H.: Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering 17(1), 90–105 (2005)

    Article  Google Scholar 

  7. Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32, 1–35 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  8. Culpepper, J.S., Moffat, A.: Enhanced byte codes with restricted prefix properties. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 1–12. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  9. Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. ACM Journal of Experimental Algorithmics 12(3.4), 1–24 (2008)

    Article  MathSciNet  Google Scholar 

  10. Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. Journal of the ACM 47(6), 987–1011 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  11. González, R., Navarro, G.: Compressed text indexes with fast locate. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 216–227. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  12. Gusfield, D.: Algorithms on strings, trees, and sequences: Computer science and computational biology. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  13. Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  14. Larsson, J., Moffat, A.: Off-line dictionary-based compression. Proceedings of the IEEE 88(11), 1722–1732 (2000)

    Article  Google Scholar 

  15. Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM Journal of Computing 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  16. Manzini, G.: Two space saving tricks for linear time LCP array computation. In: Hagerup, T., Katajainen, J. (eds.) SWAT 2004. LNCS, vol. 3111, pp. 372–383. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  17. McCreight, E.M.: A space-economical suffix tree construction algroithm. Journal of the ACM 23(2), 262–272 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  18. Navarro, G., Mäkinen, V.: Compressed full text indexes. ACM Computing Surveys 39(1) (2007)

    Google Scholar 

  19. Phoophakdee, B., Zaki, M.J.: Genome-scale disk-based suffix tree indexing. In: Chan, C.Y., Ooi, B.C., Zhou, A. (eds.) Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 833–844. ACM, New York (2007)

    Chapter  Google Scholar 

  20. Sinha, R., Puglisi, S.J., Moffat, A., Turpin, A.: Improving suffix array locality for fast pattern matching on disk. In: Lakshmanan, L.V.S., Ng, R.T., Shasha, D. (eds.) Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 661–672. ACM, New York (2008)

    Chapter  Google Scholar 

  21. Smyth, B.: Computing Patterns in Strings. Pearson Addison-Wesley, Essex (2003)

    Google Scholar 

  22. Tian, Y., Tata, S., Hankins, R.A., Patel, J.M.: Practical methods for constructing suffix trees. The VLDB Journal 14(3), 281–299 (2005)

    Article  Google Scholar 

  23. Ukkonen, E.: Online construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  24. Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pp. 1–11. IEEE Computer Society, Washington (1973)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Moffat, A., Puglisi, S.J., Sinha, R. (2009). Reducing Space Requirements for Disk Resident Suffix Arrays . In: Zhou, X., Yokota, H., Deng, K., Liu, Q. (eds) Database Systems for Advanced Applications. DASFAA 2009. Lecture Notes in Computer Science, vol 5463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00887-0_63

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00887-0_63

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00886-3

  • Online ISBN: 978-3-642-00887-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics