Abstract
A number of bioinformatics tools use regular expression (RE) matching to locate protein or DNA sequence motifs that have been discovered by researchers in the laboratory. For example, patterns representing nuclear localisation signals (NLSs) are used to predict nuclear localisation. NLSs are not yet well understood, and so the set of currently known NLSs may be incomplete. Here we use genetic programming (GP) to generate RE-based classifiers for nuclear localisation. While the approach is a supervised one (with respect to protein location), it is unsupervised with respect to already-known NLSs. It therefore has the potential to discover new NLS motifs. We apply both tree-based and linear GP to the problem. The inclusion of predicted secondary structure in the input does not improve performance. Benchmarking shows that our majority classifiers are competitive with existing tools. The evolved REs are usually “NLS-like” and work is underway to analyse these for novelty.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.H., Zhang, Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Research 25, 3389–3402 (1997)
Bairoch, A., Apweller, R.: The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acid Research 25, 31–36 (1997)
Bairoch, A., Bucher, P., Hofmann, K.: The PROSITE database, its status in 1997. Nucleic Acid Research 25, 217–221 (1997)
Brameier, M., Banzhaf, W.: A comparison of linear genetic programming and neural networks in medical data mining. IEEE-EC 5, 17–26 (2001)
Christophe, D., Christophe-Hobertus, C., Pichon, B.: Nuclear targeting of proteins: how many different signals. CS 12(5), 337–341 (2000)
Cokol, M., Nair, R., Rost, B.: Finding nuclear localization signals. EMBO Rep 1(5), 411–415 (2000)
Emanuelsson, O., Nielsen, H., Brunak, S., von Heijne, G.: Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of Molecular Biology 300(4), 1005–1016 (2000)
Fontes, M.R.M., Teh, T., Jans, D., Brinkworth, R.I., Kobe, B.: Structural basis for the specificity of bipartite nuclear localization sequence binding by importin alpha. Journal of Biological Chemistry 278(30), 27981–27987 (2003)
Hazel, P.: PCRE - Perl Compatible Regular Expressions library, http://www.pcre.org
Howard, D., Benson, K.: Promoter prediction with a GP-automaton. In: Raidl, G.R., Cagnoni, S., Cardalda, J.J.R., Corne, D.W., Gottlieb, J., Guillot, A., Hart, E., Johnson, C.G., Marchiori, E., Meyer, J.-A., Middendorf, M. (eds.) EvoIASP 2003, EvoWorkshops 2003, EvoSTIM 2003, EvoROB/EvoRobot 2003, EvoCOP 2003, EvoBIO 2003, and EvoMUSART 2003. LNCS, vol. 2611, pp. 44–53. Springer, Heidelberg (2003)
Jensen, L.J., Gupta, R., Staerfeldt, H.-H., Brunak, S.: Prediction of human protein function according to Gene Ontology categories. Bioinformatics 19(5), 635–642 (2003)
Jonassen, J.F.: Collins, and D. G. Higgins. Finding flexible patterns in unaligned protein sequences. Protein Science 4(8), 1587–1595 (1995)
Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. Journal of Molecular Biology 292, 195–202 (1999)
Koza, J.R.: Genetic Programming: On the Programming of Computers by Natural Selection. MIT Press, Cambridge (1992)
Koza, J.R., Bennett III, F.H., Andre, D.: Using programmatic motifs and genetic programming to classify protein sequences as to cellular location. In: Porto, V.W., Waagen, D. (eds.) EP 1998. LNCS, vol. 1447, pp. 437–447. Springer, Heidelberg (1998)
Macara, G.: Transport into and out of the nucleus. Microbiology and Molecular Biology Reviews 65(4), 570–594 (2001)
MacCallum, R.M.: Introducing a Perl Genetic Programming System: and Can Meta-evolution Solve the Bloat Problem? In: Ryan, C., Soule, T., Keijzer, M., Tsang, E.P.K., Poli, R., Costa, E. (eds.) EuroGP 2003. LNCS, vol. 2610, pp. 369–378. Springer, Heidelberg (2003)
Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochem. Biophys. Acta 405, 442–451 (1975)
Mulder, N.J., et al.: The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acid Research 31(1), 315–318 (2003)
Nair, R., Rost, B.: Better prediction of sub-cellular localization by combining evolutionary and structural information. Proteins: Structure, Function and Genetics 53(4), 917–930 (2003)
Nakai, K., Horton, P.: PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends in Biochemical Science 24(1), 34–36 (1999)
Pearson, W.R.: Rapid and sensitive sequence comparison with FASTP and FASTA. Methods in Enzymology 183, 63–98 (1990)
Reinhardt, A., Hubbard, T.: Using neural networks for prediction of the subcellular location of proteins. Nucleic Acid Research 26(9), 2230–2236 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Heddad, A., Brameier, M., MacCallum, R.M. (2004). Evolving Regular Expression-Based Sequence Classifiers for Protein Nuclear Localisation. In: Raidl, G.R., et al. Applications of Evolutionary Computing. EvoWorkshops 2004. Lecture Notes in Computer Science, vol 3005. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24653-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-540-24653-4_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21378-9
Online ISBN: 978-3-540-24653-4
eBook Packages: Springer Book Archive