Abstract
We present in this paper two algorithms. The first one extracts repeated motifs from a sequence defined over an alphabet σ. For instance, σ may be equal to {A, C, G, T} and the sequence represents an encoding of a DNA macromolecule. The motifs searched correspond to words over the same alphabet which occur a minimum number q of times in the sequence with at most e mismatches each time (q is called the quorum constraint). The second algorithm extracts common motifs from a set of N ≥ 2 sequences. In this case, the motifs must occur, again with at most e mismatches, in 1 ≤ q ≤ N distinct sequences of the set. In both cases, the words representing the motifs may never be present exactly in the sequences. We therefore speak of the motifs, repeated in a sequence or common to a set of them, as being “external” objects and denote them by the expression “valid models” if they verify the quorum constraint q. The approach we introduce here for finding all valid models corresponding to either repeated or common motifs starts by building a suffix tree of the sequence(s) and then, after some further preprocessing, uses this tree to simply “spell” the models. Assuming an alphabet of fixed size, the total time needed is O(nN 2 V(e, k)) using O(nN 2/w) space, where n is the (average) length of the sequence(s), k is the length of the models sought or is the length of the longest possible valid models, w is the size of a word machine and V(e, k) is the number of words of length k; at a Hamming distance at most e from another k-length word. V(e, k) may be majored by k e¦σ¦e. This improves on an algorithm by Waterman [23]. It is also a better time bound than our previous approach [15] for the common motifs problem whenever N < k¦σ¦, and a better space bound when N/w < k. It is a better time and space bound in absolute for the repeated motifs problem. The complexities obtained in this second case are O(nV(e, k)) and O(n) respectively. Finally, we suggest how to extend these algorithms to deal with gaps.
Preview
Unable to display preview. Download preview PDF.
References
R. Baeza-Yates and G. H. Gonnet. A new approach to text searching. Commun. ACM, 35:74–82, 1992.
P. Bieganski, J. Riedl, J. V. Carlis, and E.M. Retzel. Generalized suffix trees for biological sequence data: applications and implementations. In Proc. of the 27th Hawai Int. Conf. on Systems Sci., pages 35–44. IEEE Computer Society Press, 1994.
B. Clift, D. Haussler, R. McConnell, T. D. Schneider, and G. D. Stormo. Sequence landscapes. Nucleic Acids Res., 14:141–158, 1986.
A.L. Cobbs. Fast identification of approximately matching substrings. In Z. Galil and E. Ukkonen, editors, Combinatorial Pattern Matching, volume 937 of Lecture Notes in Computer Science, pages 41–54. Springer Verlag, 1995.
M. Crochemore. An optimal algorithm for computing the repetitions in a word. Inf. Proc. Letters, 12:244–250, 1981.
M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994.
D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern-recognition methods for DNA sequences. Analysis of promoter sequences from Escherichia coli. J. Mol. Biol., 186:117–128, 1985.
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, 1997.
L. C. K. Hui. Color set size problem with applications to string matching. In A. Apostolico, M. Crochemore, Z. Galil, and U. Manber, editors, Combinatorial Pattern Matching, volume 644 of Lecture Notes in Computer Science, pages 230–243. Springer-Verlag, 1992.
C. E. Lawrence and A. A. Reilly. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: struct., funct., and genetics, 7:41–51, 1990.
C. Lefevre and J.-E. Ikeda. A fast word search algorithm for the representation of sequence similarity in genomic DNA. Nucleic Acids Res., 22:404–411, 1994.
E. M. McCreight. A space-economical suffix tree construction algorithm. J. ACM, 23:262–272, 1976.
E. W. Myers. A sublinear algorithm for approximate keyword searching. Algorithmica, 12:345–374, 1994.
E. W. Myers. 1997. personal communication.
M.-F. Sagot, V. Escalier, A. Viari, and H. Soldano. Searching for repeated words in a text allowing for mismatches and gaps. In R. Baeza-Yates and U. Manber, editors, Second South American Workshop on String Processing, pages 87–100, Viñas del Mar, Chili, 1995. University of Chili.
M.-F. Sagot and E. W. Myers. Identifying satellites in nucleic acid sequences. 1998. submitted to RECOMB 1998.
M.-F. Sagot and A. Viari. A double combinatorial approach to discovering patterns in biological sequences. In D. Hirschberg and G. Myers, editors, Combinatorial Pattern Matching, volume 1075 of Lecture Notes in Computer Science, pages 186–208. Springer-Verlag, 1996.
M.-F. Sagot, A. Viari, and H. Soldano. Multiple comparison: a peptide matching approach. Theoret. Comput. Sci., 180:115–137, 1997. presented at Combinatorial Pattern Matching 1995.
E. Ukkonen. Constructing suffix trees on-line in linear time, pages 484–492. IFIP'92, 1992.
E. Ukkonen. Approximate string matching over suffix trees. In Z. Galil A. Apostolico, M. Crochemore and U. Manber, editors, Combinatorial Pattern Matching, volume 684 of Lecture Notes in Computer Science, pages 228–242. Springer-Verlag, 1993.
M. S. Waterman. Multiple sequence alignments by consensus. Nucleic Acids Res., 14:9095–9102, 1986.
M. S. Waterman. Consensus patterns in sequences. In M. S. Waterman, editor, Mathematical Methods for DNA Sequences, pages 93–116. CRC Press, 1989.
M. S. Waterman, R. Arratia, and D. J. Galas. Pattern recognition in several sequences: consensus and alignment. Bull. Math. Biol., 46:515–527, 1984.
S. Wu and U. Manber. Agrep — a fast approximate pattern-matching tool, pages 153–162, San Francisco, CA, 1992. USENIX Technical Conference.
S. Wu and U. Manber. Fast text searching allowing errors. Commun. ACM, 35:83–91, 1992.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sagot, M.F. (1998). Spelling approximate repeated or common motifs using a suffix tree. In: Lucchesi, C.L., Moura, A.V. (eds) LATIN'98: Theoretical Informatics. LATIN 1998. Lecture Notes in Computer Science, vol 1380. Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0054337
Download citation
DOI: https://doi.org/10.1007/BFb0054337
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64275-6
Online ISBN: 978-3-540-69715-2
eBook Packages: Springer Book Archive