Skip to main content

Parallel External Memory Suffix Sorting

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9133))

Included in the following conference series:

Abstract

Suffix sorting (or suffix array construction) is one of the most important tasks in string processing, with dozens of applications, particularly in text indexing and data compression. Some of these applications require the suffix array to be built for large inputs that greatly exceed the size of RAM and so external memory must be used. However, existing approaches for external memory suffix sorting either use debilitatingly large amounts of disk space, or become too slow when the size of the input data is more than a few times bigger than the size of RAM. In this paper we address the latter problem via a non-trivial parallelization of computation. In our experiments, the resulting algorithm is much faster than the best prior external memory algorithms while using very little disk space in addition to what is needed for the input and output. On the way to this result we provide the current fastest (parallel) internal memory algorithm for suffix sorting, which is usually around twice as fast as previous methods, while using around one quarter of the working space.

The work is partially supported by the Academy of Finland through grant 258308.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.1000genomes.org/.

  2. 2.

    http://dumps.wikimedia.org/.

  3. 3.

    http://www.kernel.org/.

  4. 4.

    The implementation is available at http://www.cs.helsinki.fi/group/pads/.

References

  1. Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2(1), 53–86 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  2. Andersson, A., Hagerup, T., Håstad, J., Petersson, O.: Tight bounds for searching a sorted array of strings. SIAM J. Comput. 30(5), 1552–1578 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  3. Apostolico, A., Iliopoulos, C.S., Landau, G.M., Schieber, B., Vishkin, U.: Parallel construction of a suffix tree with applications. Algorithmica 3, 347–365 (1988)

    Article  MATH  MathSciNet  Google Scholar 

  4. Apostolico, A., Lonardi, S.: Off-line compression by greedy textual substitution. Proc. IEEE 88(11), 1733–1744 (2000)

    Article  Google Scholar 

  5. Bingmann, T., Fischer, J., Osipov, V.: Inducing suffix and LCP arrays in external memory. In: Sanders, P., Zeh, N. (eds.) ALENEX 2013. pp. 88–102. SIAM (2013)

    Google Scholar 

  6. Blelloch, G.E., Shun, J.: A simple parallel cartesian tree algorithm and its application to suffix tree construction. In: Müller-Hannemann, M., Werneck, R.F.F. (eds.) ALENEX 2011, pp. 48–58. SIAM (2011)

    Google Scholar 

  7. Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32(1), 1–35 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  8. Dementiev, R., Kettner, L., Sanders, P.: STXXL: standard template library for XXL data sets. Softw. Pract. Exper. 38(6), 589–637 (2008)

    Article  Google Scholar 

  9. Deo, M., Keely, S.: Parallel suffix array and least common prefix for the GPU. In: Nicolau, A., Shen, X., Amarasinghe, S.P., Vuduc, R.W. (eds.) PPoPP 2013, pp. 197–206. ACM (2013)

    Google Scholar 

  10. Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM 47(6), 987–1011 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  11. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  12. Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. Algorithmica 63(3), 707–730 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  13. Gonnet, G.H., Baeza-Yates, R.A., Snider, T.: New indices for text: pat trees and pat arrays. In: Frakes, W.B., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms, pp. 66–82. Prentice-Hall, Englewood Cliffs (1992)

    Google Scholar 

  14. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  15. Kärkkäinen, J., Kempa, D.: Engineering a lightweight external memory suffix array construction algorithm. In: Iliopoulos, C.S., Langiu, A. (eds.) ICABD 2014, CEUR Workshop Proceedings, vol. 1146, pp. 53–60 (2014). CEUR-WS.org

  16. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time Lempel-Ziv factorization: simple, fast, small. In: Fischer, J., Sanders, P. (eds.) CPM 2013. LNCS, vol. 7922, pp. 189–200. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  17. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lempel-Ziv parsing in external memory. In: Bilgin, A., Marcellin, M.W., Serra-Sagristà, J., Storer, J.A. (eds.) DCC 2014, pp. 153–162. IEEE (2014)

    Google Scholar 

  18. Kärkkäinen, J., Kempa, D., Puglisi, S.J.: String range matching. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 232–241. Springer, Heidelberg (2014)

    Google Scholar 

  19. Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 918–936 (2006)

    Article  MathSciNet  Google Scholar 

  20. Kulla, F., Sanders, P.: Scalable parallel suffix array construction. Parallel Comput. 33(9), 605–612 (2007)

    Article  Google Scholar 

  21. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.L.: Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biol. 10(3), R25 (2009)

    Article  Google Scholar 

  22. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25(14), 1754–1760 (2009)

    Article  Google Scholar 

  23. Louza, F.A., Telles, G.P., Ciferri, C.D.D.A.: External memory generalized suffix and LCP arrays construction. In: Fischer, J., Sanders, P. (eds.) CPM 2013. LNCS, vol. 7922, pp. 201–210. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  24. Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  25. Mori, Y.: libdivsufsort, a C library for suffix array construction. http://code.google.com/p/libdivsufsort/

  26. Nakamura, R., Inenaga, S., Bannai, H., Funamoto, T., Takeda, M., Shinohara, A.: Linear-time text compression by longest-first substitution. Algorithms 2(4), 1429–1448 (2009)

    Article  MathSciNet  Google Scholar 

  27. Nong, G., Chan, W.H., Zhang, S., Guan, X.F.: Suffix array construction in external memory using d-critical substrings. ACM Trans. Inf. Syst. 32(1), 1 (2014)

    Article  Google Scholar 

  28. Osipov, V.: Parallel suffix array construction for shared memory architectures. In: Calderón-Benavides, L., González-Caro, C., Chávez, E., Ziviani, N. (eds.) SPIRE 2012. LNCS, vol. 7608, pp. 379–384. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  29. Puglisi, S.J., Smyth, W.F., Turpin, A.: A taxonomy of suffix array construction algorithms. ACM Comput. Surv. 39(2), 31 (2007). Article 4

    Article  Google Scholar 

  30. Simpson, J.T., Durbin, R.: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22(3), 549–556 (2012)

    Article  Google Scholar 

  31. Tischler, G.: Faster average case low memory semi-external construction of the Burrows-Wheeler transform. In: Iliopoulos, C.S., Langiu, A. (eds.) ICABD 2014, CEUR Workshop Proceedings, vol. 1146, pp. 61–68 (2014). CEUR-WS.org

  32. Weiner, P.: Linear pattern matching algorithms. In: SWAT 1973, pp. 1–11. IEEE (1973)

    Google Scholar 

  33. Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)

    Article  Google Scholar 

  34. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23(3), 337–343 (1977)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominik Kempa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Kärkkäinen, J., Kempa, D., Puglisi, S.J. (2015). Parallel External Memory Suffix Sorting. In: Cicalese, F., Porat, E., Vaccaro, U. (eds) Combinatorial Pattern Matching. CPM 2015. Lecture Notes in Computer Science(), vol 9133. Springer, Cham. https://doi.org/10.1007/978-3-319-19929-0_28

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-19929-0_28

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-19928-3

  • Online ISBN: 978-3-319-19929-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics