Efficient Enumeration of Phylogenetically Informative Substrings

Angelov, Stanislav; Harb, Boulos; Kannan, Sampath; Khanna, Sanjeev; Kim, Junhyong

doi:10.1007/11732990_22

Stanislav Angelov²⁴,
Boulos Harb²⁴,
Sampath Kannan²⁴,
Sanjeev Khanna²⁴ &
…
Junhyong Kim²⁵

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3909))

Included in the following conference series:

Annual International Conference on Research in Computational Molecular Biology

1257 Accesses
1 Citations

Abstract

We study the problem of enumerating substrings that are common amongst genomes that share evolutionary descent. For example, one might want to enumerate all identical (therefore conserved) substrings that are shared between all mammals and not found in non-mammals. Such collection of substrings may be used to identify conserved subsequences or to construct sets of identifying substrings for branches of a phylogenetic tree. For two disjoint sets of genomes on a phylogenetic tree, a substring is called a discriminating substring or a tag if it is found in all of the genomes of one set and none of the genomes of the other set. Given a phylogeny for a set of m species, each with a genome of length at most n, we develop a suffix-tree based algorithm to find all tags in O(nm log² m) time. We also develop a sublinear space algorithm (at the expense of running time) that is more suited for very large data sets. We next consider a stochastic model of evolution to understand how tags arise. We show that in this setting, a simple process of tag generation essentially captures all possible ways of generating tags. We use this insight to develop a faster tag discovery algorithm with a small chance of error. However, tags are not guaranteed to exist in a given data set. We thus generalize the notion of a tag from a single substring to a set of substrings whereby each species in one set contains a large fraction of the substrings while each species in the other set contains only a small fraction of the substrings. We study the complexity of this problem and give a simple linear programming based approach for finding approximate generalized tag sets. Finally, we use our tag enumeration algorithm to analyze a phylogeny containing 57 whole microbial genomes. We find tags for all nodes in the phylogeny except the root for which we find generalized tag sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bejerano, G., Siepel, A., Kent, W., Haussler, D.: Computational screening of conserved genomic DNA in search of functional noncoding elements. Nature Methods 2(7), 535–545 (2005)
Article Google Scholar
Siepel, A., Bejerano, G., Pedersen, J., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L., Richards, S., Weinstock, G., Wilson, R.K., Gibbs, R., Kent, W., Miller, W., Haussler, D.: Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Research 15(8), 1034–1050 (2005)
Article Google Scholar
Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W., Mattick, J., Haussler, D.: Ultraconserved elements in the human genome. Science 304(5675), 1321–1325 (2004)
Article Google Scholar
Amann, R., Ludwig, W.: Ribosomal RNA-targeted nucleic acid probes for studies in microbial ecology. FEMS Microbiology Reviews 24(5), 555–565 (2000)
Article Google Scholar
Angelov, S., Harb, B., Kannan, S., Khanna, S., Kim, J., Wang, L.S.: Genome identification and classification by short oligo arrays. In: Proceedings of the Fourth Annual Workshop on Algorithms in Bioinformatics (2004)
Google Scholar
Wolf, Y.I., Rogozin, I.B., Grishin, N.V., Koonin, E.V.: Genome trees and the tree of life. Trends in Genetics 18(9), 472–479 (2002)
Article Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proc. of the 14th IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York (1997)
Book MATH Google Scholar
Hui, L.: Color set size problem with applications to string matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 227–240. Springer, Heidelberg (1992)
Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM (JACM) 23(2), 262–272 (1976)
Article MATH MathSciNet Google Scholar
Ukkonen, E.: On-line construction of suffix-trees. Algorithmica 14, 249–260 (1995)
Article MATH MathSciNet Google Scholar
Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM Journal of Computing 13(2), 338–355 (1984)
Article MATH MathSciNet Google Scholar
Schieber, B., Vishkin, U.: On finding lowest common ancestors: Simplifications and parallelization. SIAM Journal of Computing 17, 1253–1262 (1988)
Article MATH MathSciNet Google Scholar
Brown, M.R., Tarjan, R.E.: Design and analysis of data structures for representing sorted lists. SIAM Journal of Computing 9(3), 594–614 (1980)
Article MATH MathSciNet Google Scholar
Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12, 327–343 (1994)
Article MATH MathSciNet Google Scholar
Thomas, J., et al.: Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424(6950), 788–793 (2003)
Article Google Scholar
Maidak, B.L., Cole, J.R., Lilburn, T.G., Parker, C.T.J., Sax man, P.R., Farris, R.J., Garrity, G.M., Olsen, G.J., Schmidt, T.M., Tie dje, J.M.: The RDP-II (ribosomal database project). Nucl. Acids. Res. 29(1), 173–174 (2001)
Article Google Scholar
Jukes, T.H., Cantor, C.: Mammalian Protein Metabolism, chapter Evolution of protein molecules. Academic Press, New York (1969)
Google Scholar
Matveeva, O.V., Shabalina, S.A., Nemtsov, V.A., Tsodikov, A.D., Gesteland, R.F., Atkins, J.F.: Thermodynamic calculations and statistical correlations for oligo-probes design. Nucl. Acids. Res. 31(14), 4211–4217 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Information Sciences, University of Pennsylvania,
Stanislav Angelov, Boulos Harb, Sampath Kannan & Sanjeev Khanna
Department of Biology, University of Pennsylvania,
Junhyong Kim

Authors

Stanislav Angelov
View author publications
You can also search for this author in PubMed Google Scholar
Boulos Harb
View author publications
You can also search for this author in PubMed Google Scholar
Sampath Kannan
View author publications
You can also search for this author in PubMed Google Scholar
Sanjeev Khanna
View author publications
You can also search for this author in PubMed Google Scholar
Junhyong Kim
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Georgia Institute of Technology and Università di Padova,
Alberto Apostolico
Topic Chairs, P.O. Box
Concettina Guerra
Center for Molecular Biology and Computer Sciecne Department, Brown University, 115 Waterman St., 02912, Providence, RI, USA
Sorin Istrail
University of California, San Diego, USA
Pavel A. Pevzner
Department of Molecular and Computational Biology, University of Southern California, 1050 Childs Way, 90089-2910, Los Angeles, CA, USA
Michael Waterman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Angelov, S., Harb, B., Kannan, S., Khanna, S., Kim, J. (2006). Efficient Enumeration of Phylogenetically Informative Substrings. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P.A., Waterman, M. (eds) Research in Computational Molecular Biology. RECOMB 2006. Lecture Notes in Computer Science(), vol 3909. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11732990_22

Download citation

DOI: https://doi.org/10.1007/11732990_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33295-4
Online ISBN: 978-3-540-33296-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics