Abstract
The unidirectional and bidirectional matching statistics between two strings s and t on alphabet Σ, and the shortest unique substrings of a single string t, are the cornerstone of a number of large-scale genome analysis applications, and they encode nontrivial structural properties of s and t. In this paper we compute for the first time the matching statistics between s and t in O((|s| + |t|)log|Σ|) time and in O(|s|log|Σ|) bits of space, circumventing the need for computing the depths of suffix tree nodes that characterized previous approaches. Symmetrically, we compute for the first time the shortest unique substrings of a string t in O(|t|log|Σ|) time and in O(|t|log|Σ|) bits of space. A key component of our methods is an encoding of both the unidirectional and the bidirectional statistics that takes 2|t| + o(|t|) bits of space and that allows constant-time access to every position.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Proceedings of the 46th ACM Symposium on Theory of Computing. ACM (2014)
Belazzougui, D.: Linear time construction of compressed text indices in compact space. ArXiv preprint ArXiv:1401.0936 (2014)
Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Transactions on Algorithms (2013) (to appear)
Clark, D.: Compact Pat Trees. PhD thesis, University of Waterloo, Canada (1996)
Farach, M., Noordewier, M., Savari, S., Shepp, L., Wyner, A., Ziv, J.: On the entropy of DNA: Algorithms and measurements based on memory and rapid convergence. In: Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 48–57 (1995)
Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing 40(2), 465–492 (2011)
Golynski, A.: Optimal lower bounds for rank and select indexes. Theoretical Computer Science 387(3), 348–359 (2007)
Haubold, B., Pierstorff, N., Möller, F., Wiehe, T.: Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics 6(1), 123 (2005)
Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. SIAM J. Comput. 38(6), 2162–2178 (2009)
İleri, A.M., Külekci, M.O., Xu, B.: Shortest unique substring query revisited. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 172–181. Springer, Heidelberg (2014)
Kurtz, S., Phillippy, A., Delcher, A.L., Smoot, M., Shumway, M., Antonescu, C., Salzberg, S.L.: Versatile and open software for comparing large genomes. Genome Biology 5(2), R12 (2004)
Munro, J.I.: Tables. In: Chandru, V., Vinay, V. (eds.) FSTTCS 1996. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)
Ohlebusch, E., Gog, S., Kügel, A.: Computing matching statistics and maximal exact matches on compressed full-text indexes. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 347–358. Springer, Heidelberg (2010)
Pei, J., Wu, W.-H., Yeh, M.-Y.: On shortest unique substring queries. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 937–948. IEEE (2013)
Philippe, N., Salson, M., Commes, T., Rivals, E.: CRAC: An integrated approach to the analysis of RNA-seq reads. Genome Biology 14(3), R30 (2013)
Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms (TALG) 3(4), 43 (2007)
Robertson, M.M.: A generalization of quasi-monotone sequences. Proceedings of the Edinburgh Mathematical Society (Series 2) 16(01), 37–41 (1968)
Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)
Schnattinger, T., Ohlebusch, E., Gog, S.: Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput. 213, 13–22 (2012)
Tsuruta, K., Inenaga, S., Bannai, H., Takeda, M.: Shortest unique substrings queries in optimal time. In: Geffert, V., Preneel, B., Rovan, B., Štuller, J., Tjoa, A.M. (eds.) SOFSEM 2014. LNCS, vol. 8327, pp. 503–513. Springer, Heidelberg (2014)
Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology 13(2), 336–350 (2006)
Weiner, P.: The file transmission problem. In: Proceedings of the National Computer Conference and Exposition, June 4-8, pp. 453–453. ACM (1973)
Weiner, P.: Linear pattern matching algorithms. In: Switching and Automata Theory, pp. 1–11. IEEE (1973)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Belazzougui, D., Cunial, F. (2014). Indexed Matching Statistics and Shortest Unique Substrings. In: Moura, E., Crochemore, M. (eds) String Processing and Information Retrieval. SPIRE 2014. Lecture Notes in Computer Science, vol 8799. Springer, Cham. https://doi.org/10.1007/978-3-319-11918-2_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-11918-2_18
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11917-5
Online ISBN: 978-3-319-11918-2
eBook Packages: Computer ScienceComputer Science (R0)