Skip to main content

Common Substrings in Random Strings

  • Conference paper
Combinatorial Pattern Matching (CPM 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4009))

Included in the following conference series:

Abstract

In computational biology, an important problem is to identify a word of length k present in each of a given set of sequences. Here, we investigate the problem of calculating the probability that such a word exists in a set of r random strings. Existing methods to approximate this probability are either inaccurate when r > 2 or are restricted to Bernoulli models. We introduce two new methods for computing this probability under Bernoulli and Markov models. We present generalizations of the methods to compute the probability of finding a word of length k shared among q of r sequences, and to allow mismatches. We show through simulations that our approximations are significantly more accurate than methods previously published.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)

    Google Scholar 

  2. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389–3402 (1997)

    Article  Google Scholar 

  3. Arratia, R., Waterman, M.S.: An ErdÅ‘s-RĂ©nyi law with shifts. Advances in Mathematics 55, 13–23 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  4. Arratia, R., Waterman, M.S.: Critical Phenomena in sequence matching. The Annals of Probability 13, 1236–1249 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  5. Blais, E.: Computing Probabilities for Common Substrings in Random Strings. M.Sc. Thesis, McGill University (2006)

    Google Scholar 

  6. ErdÅ‘s, P., RĂ©nyi, A.: On a new law of large numbers. Journal d’Analyse MathĂ©matique 22, 103–111 (1970)

    Article  Google Scholar 

  7. ErdÅ‘s, P., RĂ©vĂ©sz, P.: On the length of the longest head run. Topics in Information Theory. Coll. Math. Soc. JĂ¡nos Bolyai 16, 219–228 (1975)

    Google Scholar 

  8. Feller, W.: An Introduction to Probability Theory and its Applications, 3rd edn., vol. 1. John Wiley & Sons, Chichester (1968)

    MATH  Google Scholar 

  9. Fishman, G.S.: Monte Carlo: Concepts, Algorithms, and Apps. Springer, Heidelberg (1996)

    Google Scholar 

  10. Guibas, L.J., Odlyzko, A.M.: String overlaps, pattern matching, and nontransitive games. Journal of Combinatorial Theory, Series A 30, 183–208 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  11. Harary, F.: Graphical Enumeration. Academic Press, London (1973)

    MATH  Google Scholar 

  12. Karlin, S., Ost, F.: Maximal length of common words among random letter sequences. The Annals of Probability 16, 535–563 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  13. Morgenstern, B., Frech, K., Dress, A., Werner, T.: DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics 14, 290–294 (1998)

    Article  Google Scholar 

  14. Naus, J., Sheng, K.-N.: Matching among multiple random sequences. Bulletin of Mathematical Biology 59, 483–496 (1997)

    Article  MATH  Google Scholar 

  15. Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoretical Computer Science 287, 593–617 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  16. Nijenhuis, A., Wilf, H.: Combinatorial Algorithms for Computers and Calculators. Academic Press, London (1978)

    MATH  Google Scholar 

  17. Pevzner, P.A., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proc. 8th Inter. Conf. on Int. Sys. for Mol. Biol., pp. 269–278 (2000)

    Google Scholar 

  18. Régnier, M.: A unified approach to word occurrence probabilities. Discrete Applied Mathematics 104, 259–280 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  19. Régnier, M., Szpankowski, W.: On pattern frequency occurrences in a Markovian sequence. Algorithmica 22, 631–649 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  20. Sinha, S., Tompa, M.: Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research 30, 5549–5560 (2002)

    Article  Google Scholar 

  21. van Aardenne-Ehrenfest, T., de Bruijn, N.G.: Circuits and trees in oriented linear graphs. Simon Stevin 28, 203–217 (1951)

    MathSciNet  MATH  Google Scholar 

  22. van Helden, J., André, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. Journal of Molecular Biology 281, 827–842 (1998)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Blais, E., Blanchette, M. (2006). Common Substrings in Random Strings. In: Lewenstein, M., Valiente, G. (eds) Combinatorial Pattern Matching. CPM 2006. Lecture Notes in Computer Science, vol 4009. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780441_13

Download citation

  • DOI: https://doi.org/10.1007/11780441_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-35455-0

  • Online ISBN: 978-3-540-35461-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics