Abstract
Matching a mass spectrum against a text (a key computational task in proteomics) is slow since the existing text indexing algorithms (with search time independent of the text size) are not applicable in the domain of mass spectrometry. As a result, many important applications (e.g., searches for mutated peptides) are prohibitively time-consuming and even the standard search for non-mutated peptides is becoming too slow with recent advances in high-throughput genomics and proteomics technologies. We introduce a new paradigm – the Blocked Pattern Matching (BPM) Problem - that models peptide identification. BPM corresponds to matching a pattern against a text (over the alphabet of integers) under the assumption that each symbol a in the pattern can match a block of consecutive symbols in the text with total sum a.
BPM opens a new, still unexplored, direction in combinatorial pattern matching and leads to the Mutated BPM (modeling identification of mutated peptides) and Fused BPM (modeling identification of fused peptides in tumor genomes). We illustrate how BPM algorithms solve problems that are beyond the reach of existing proteomics tools.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abascal, F., Posada, D., Knight, R.D., Zardoya, R.: Parallel evolution of the genetic code in arthropod mitochondrial genomes. PLoS Biol. 4(5), e127 (2006)
Amir, A.: Asynchronous pattern matching. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 1–10. Springer, Heidelberg (2006)
Amir, A., Aumann, Y., Indyk, P., Levy, A., Porat, E.: Efficient computations of ℓ1 and ℓ ∞ rearrangement distances. In: Ziviani, N., Baeza-Yates, R. (eds.) SPIRE 2007. LNCS, vol. 4726, pp. 39–49. Springer, Heidelberg (2007)
Amir, A., Aumann, Y., Benson, G., Levy, A., Lipsky, O., Porat, E., Skiena, S., Vishne, U.: Pattern matching with address errors: rearrangement distances. In: Proc. 17th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1221–1229 (2006)
Amir, A., Aumann, Y., Kapah, O., Levy, A., Porat, E.: Approximate string matching with address bit errors. In: Ferragina, P., Landau, G.M. (eds.) CPM 2008. LNCS, vol. 5029, pp. 118–129. Springer, Heidelberg (2008)
Amir, A., Eisenberg, E., Keller, O., Levy, A., Porat, E.: Approximate string matching with stuck address bits. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 395–405. Springer, Heidelberg (2010)
Amir, A., Hartman, T., Kapah, O., Levy, A., Porat, E.: On the cost of interchange rearrangement in strings. In: Arge, L., Hoffmann, M., Welzl, E. (eds.) ESA 2007. LNCS, vol. 4698, pp. 99–110. Springer, Heidelberg (2007)
Amir, A., Kapah, O., Porat, E.: Deterministic length reduction: Fast convolution in sparse data and applications. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 183–194. Springer, Heidelberg (2007)
Baeza-Yates, R.: A fast set intersection algorithm for sorted sequences. In: Sahinalp, S., Muthukrishnan, S., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 400–408. Springer, Heidelberg (2004)
Besemer, J., Lomsadze, A., Borodovsky, M.: Genemarks: a self-training method for prediction of gene starts in microbial genomes implications for finding sequence motifs in regulatory regions. Nucleic Acids Research 29(12), 2607–2618 (2001)
Cardoze, D.E., Schulman, L.J.: Pattern matching for spatial point sets. In: Proc. 39th Annu. IEEE Sympos. Found. Comput. Sci., pp. 156–165 (1998)
Castellana, N.E., Payne, S.H., Shen, Z., Stanke, M., Bafna, V., Briggs, S.P.: Discovery and revision of arabidopsis genes by proteogenomics. Proceedings of the National Academy of Sciences 105(52), 21034–21038 (2008)
Cohen, H., Porat, E.: Fast set intersection and two-patterns matching. In: López-Ortiz, A. (ed.) LATIN 2010. LNCS, vol. 6034, pp. 234–242. Springer, Heidelberg (2010)
Cole, R., Hariharan, R.: Approximate string matching: A simpler faster algorithm. SIAM J. Comput. 31(6), 1761–1782 (2002)
Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. MIT Press and McGraw-Hill (1992)
Demaine, E.D., López-Ortiz, A., Munro, I.J.: Adaptive set intersections, unions and differences. In: Proc. 11th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 743–752 (2000)
Dietz, P., Mehlhorn, K., Raman, R., Uhrig, C.: Lower bounds for set intersection queries. Algorithmica 14(2), 154–168 (1993)
Elenitoba-Johnson, K.S.J., Crockett, D.K., Schumacher, J.A., Jenson, S.D., Coffin, C.M., Rockwood, A.L., Lim, M.S.: Proteomic identification of oncogenic chromosomal translocation partners encoding chimeric anaplastic lymphoma kinase fusion proteins. Proceedings of the National Academy of Sciences 103(19), 7402–7407 (2006)
Eng, J., McCormack, A., Yates, J.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry 5(11), 976–989 (1994)
Frank, A.M., Pevzner, P.A.: PepNovo: De Novo Peptide Sequencing via Probabilistic Network Modeling. Anal. Chem. 77, 964–973 (2005)
Guigó, R., Gusfield, D., Edwards, N., Lippert, R.: Generating peptide candidates from amino-acid sequence databases for protein identification via mass spectrometry. In: Guigó, R., Gusfield, D. (eds.) WABI 2002. LNCS, vol. 2452, pp. 68–81. Springer, Heidelberg (2002)
Gupta, N., Tanner, S., Jaitly, N., Adkins, J., Lipton, M., Edwards, R., Romine, M., Osterman, A., Bafna, V., Smith, R., Pevzner, P.: Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. Genome Res. 17, 1362–1377 (2007)
Gupta, N., Pevzner, P.A.: False discovery rates of protein identifications: A strike against the two-peptide rule. Journal of Proteome Research 8(9), 4173–4181 (2009)
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York (1997)
Jaffe, J.D., Stange-Thomann, N., Smith, C., DeCaprio, D., Fisher, S., Butler, J., Calvo, S., Elkins, T., FitzGerald, M.G., Hafez, N., Kodira, C.D., Major, J., Wang, S., Wilkinson, J., Nicol, R., Nusbaum, C., Birren, B., Berg, H.C., Church, G.M.: The complete genome and proteome of mycoplasma mobile. Genome Research 14(8), 1447–1461 (2004)
Jeong, K., Bandeira, N., Kim, S., Pevzner, P.A.: Gapped spectral dictionaries and their applications for database searches of tandem mass spectra. Mol. Cell. Proteomics (2010) (in press)
Kapah, O., Landau, G.M., Levy, A., Oz, N.: Interchange rearrangement: The element-cost model. Theoretical Computer Science 410(43), 4315–4326 (2009)
Kim, S., Bandeira, N., Pevzner, P.A.: Spectral profiles: A novel representation of tandem mass spectra and its applications for de novo peptide sequencing and identification. Mol. Cell. Proteomics 8, 1391–1400 (2009)
Kim, S., Gupta, N., Bandeira, N., Pevzner, P.A.: Spectral dictionaries: Integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 8(1), 53–69 (2009)
Kim, S., Gupta, N., Pevzner, P.A.: Spectral probabilities and generating functions of tandem mass spectra: A strike against decoy databases. Journal of Proteome Research 7(8), 3354–3363 (2008)
Knight, R.D., Freeland, S.J., Landweber, L.F.: Rewiring the keyboard: evolvability of the genetic code. Nat. Rev. Genet. 2(1), 49–58 (2001)
Merrihew, G.E., Davis, C., Ewing, B., Williams, G., Käll, L., Frewen, B.E., Noble, W.S., Green, P., Thomas, J.H., MacCoss, M.J.: Use of shotgun proteomics for the identification, confirmation, and correction of c. elegans gene annotations. Genome Research 18(10), 1660–1669 (2008)
Muthukrishnan, S.: New results and open problems related to non-standard stringology. In: Galil, Z., Ukkonen, E. (eds.) CPM 1995. LNCS, vol. 937, pp. 298–317. Springer, Heidelberg (1995)
Ng, J., Pevzner, P.A.: Algorithm for identification of fusion proteins via mass spectrometry. Journal of Proteome Research 7(1), 89–95 (2008)
Nielsen, P., Krogh, A.: Large-scale prokaryotic gene prediction and comparison to genome annotation. Bioinformatics 21(24), 4322–4329 (2005)
Park, C.Y., Klammer, A.A., Käll, L., MacCoss, M.J., Noble, W.S.: Rapid and accurate peptide identification from tandem mass spectra. Journal of Proteome Research 7(7), 3022–3027 (2008)
Shilov, I.V., Seymour, S.L., Patel, A.A., Loboda, A., Tang, W.H., Keating, S.P., Hunter, C.L., Nuwaysir, L.M., Schaeffer, D.A.: The paragon algorithm, a next generation search engine that uses sequence temperature values and feature probabilities to identify peptides from tandem mass spectra. Molecular & Cellular Proteomics 6(9), 1638–1655 (2007)
Tanner, S., Shu, H., Frank, A., Wang, L.C., Zandi, E., Mumby, M., Pevzner, P.A., Bafna, V.: Inspect: Identification of posttranslationally modified peptides from tandem mass spectra. Analytical Chemistry 77(14), 4626–4639 (2005)
Tsur, D., Tanner, S., Zandi, E., Bafna, V., Pevzner, P.: Identification of post-translational modifications by blind search of mass spectra. Nature Biotechnology 23(12), 1562–1567 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ng, J., Amir, A., Pevzner, P.A. (2011). Blocked Pattern Matching Problem and Its Applications in Proteomics. In: Bafna, V., Sahinalp, S.C. (eds) Research in Computational Molecular Biology. RECOMB 2011. Lecture Notes in Computer Science(), vol 6577. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20036-6_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-20036-6_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20035-9
Online ISBN: 978-3-642-20036-6
eBook Packages: Computer ScienceComputer Science (R0)