Abstract
The suffix array is an efficient in-memory data structure for pattern search; and two-level variants also exist that are suited to external searching and can handle strings larger than the available memory. Assuming the latter situation, we introduce a factor-based mechanism for compressing the text string that integrates seamlessly with the in-memory index search structure, rather than requiring a separate dictionary. We also introduce a mixture of indexed and sequential pattern search in a trade-off that allows further space savings. Experiments on a 4 GB computer with 62.5 GB of English text show that a two-level arrangement is possible in which around 2.5% of the text size is required as an index and for which the disk-resident components, including the text itself, occupy less than twice the space of the original text; and with which count queries can be carried out using two disk accesses and less than two milliseconds of CPU time.
The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI: 10.1007/978-3-319-02432-5_33
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Baeza-Yates, R.A., Barbosa, E.F., Ziviani, N.: Hierarchies of indices for text searching. Inf. Systems 21(6), 497–514 (1996)
Colussi, L., De Col, A.: A time and space efficient data structure for string searching on large texts. Inf. Processing Letters 58(5), 217–222 (1996)
Ferguson, M.P.: FEMTO: Fast search of large sequence collections. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 208–219. Springer, Heidelberg (2012)
Ferragina, P., Grossi, R.: The string B-tree: A new data structure for search in external memory and its applications. J. ACM 46(2), 236–280 (1999)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Gog, S., Moffat, A., Culpepper, J.S., Turpin, A., Wirth, A.: Large-scale pattern search using reduced-space on-disk suffix arrays. IEEE Trans. Knowledge and Data Engineering (to appear)
Gog, S., Petri, M.: Optimized succinct data structures for massive data. Software Practice & Experience (to appear, 2013), http://dx.doi.org/10.1002/spe.2198
González, R., Navarro, G.: A compressed text index on secondary memory. J. Combinatorial Mathematics and Combinatorial Comp. 71, 127–154 (2009)
Kärkkäinen, J., Rao, S.S.: Full-text indexes in external memory. In: Meyer, U., Sanders, P., Sibeyn, J.F. (eds.) Algorithms for Memory Hierarchies. LNCS, vol. 2625, pp. 149–170. Springer, Heidelberg (2003)
Mäkinen, V., Navarro, G.: Compressed compact suffix arrays. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 420–433. Springer, Heidelberg (2004)
Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comp. 22(5), 935–948 (1993)
Moffat, A., Puglisi, S.J., Sinha, R.: Reducing space requirements for disk resident suffix arrays. In: Zhou, X., Yokota, H., Deng, K., Liu, Q. (eds.) DASFAA 2009. LNCS, vol. 5463, pp. 730–744. Springer, Heidelberg (2009)
Sinha, R., Puglisi, S.J., Moffat, A., Turpin, A.: Improving suffix array locality for fast pattern matching on disk. In: Wang, J.T.-L. (ed.) Proc. ACM SIGMOD Int. Conf. Management of Data, pp. 661–672 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gog, S., Moffat, A. (2013). Adding Compression and Blended Search to a Compact Two-Level Suffix Array. In: Kurland, O., Lewenstein, M., Porat, E. (eds) String Processing and Information Retrieval. SPIRE 2013. Lecture Notes in Computer Science, vol 8214. Springer, Cham. https://doi.org/10.1007/978-3-319-02432-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-02432-5_18
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02431-8
Online ISBN: 978-3-319-02432-5
eBook Packages: Computer ScienceComputer Science (R0)