Skip to main content

Gene Prediction Methods

  • Chapter
  • First Online:
Bioinformatics

Abstract

Most computational gene-finding methods in current use are derived from the fields of natural language processing and speech recognition. These latter fields are concerned with parsing spoken or written language into functional components such as nouns, verbs, and phrases of various types. The parsing task is governed by a set of syntax rules that dictate which linguistic elements may immediately follow each other in well-formed sentences – for example,

$$subject \rightarrow verb,\, verb \rightarrow direct\, object,\, etc\ldots$$

The problem of gene-finding is rather similar to linguistic parsing in that we wish to partition a sequence of letters into elements of biological relevance, such as exons, introns, and the intergenic regions separating genes. That is, we wish to not only find the genes, but also to predict their internal exon-intron structure so that the encoded protein(s) may be deduced. Figure 5.1 illustrates this internal structure for a typical gene.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Adams MD, Celniker SE, Holt RA, 194 co-authors et al (2000) The genome sequence of Dosophila melanogaster. Science 287:2185–2195

    Article  PubMed  Google Scholar 

  • Alexandersson M, Cawley S, Pachter L (2003) SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res 13(3):496–502

    Article  CAS  PubMed  Google Scholar 

  • Allen JE, Salzberg SL (2005) JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics (Oxford, England) 21(18)):3596–3603

    Article  CAS  Google Scholar 

  • Allen JE, Salzberg SL (2006) A phylogenetic generalized hidden Markov model for predicting alternatively spliced exons. Algorithms Mol Biol 1:14

    Article  PubMed  Google Scholar 

  • Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410

    CAS  PubMed  Google Scholar 

  • Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402

    Article  CAS  PubMed  Google Scholar 

  • Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput Biol 3(3):e54

    Article  PubMed  Google Scholar 

  • Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res 14(5):988–995

    Article  CAS  PubMed  Google Scholar 

  • Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF et al (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14(4):708–715

    Article  CAS  PubMed  Google Scholar 

  • Bray N, Pachter L (2004) MAVID: constrained ancestral alignment of multiple sequences. Genome Res 14(4):693–699

    Article  CAS  PubMed  Google Scholar 

  • Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268(1):78–94

    Article  CAS  PubMed  Google Scholar 

  • Cantarel BL, Korf I, Robb SM, Parra G, Ross E et al (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18(1):188–196

    Article  CAS  PubMed  Google Scholar 

  • Castellano S, Lobanov AV, Chapple C, Novoselov SV, Albrecht M et al (2005) Diversity and functional plasticity of eukaryotic selenoproteins: identification and characterization of the SelJ family. Proc Natl Acad Sci USA 102(45):16188–16193

    Article  CAS  PubMed  Google Scholar 

  • Cawley SE, Wirth AI, Speed TP (2001) Phat – a gene finding program for Plasmodium falciparum. Mol Biochem Parasitol 118(2):167–174

    Article  CAS  PubMed  Google Scholar 

  • Cawley SL, Pachter L (2003) HMM sampling and applications to gene finding and alternative splicing. Bioinformatics (Oxford, England) 19(Suppl 2):ii36–ii41

    Google Scholar 

  • Cormen TH, Leiserson CE, Rivest RL (1992) Introduction to algorithms. MIT, Cambridge, MA

    Google Scholar 

  • DeCaprio D, Vinson JP, Pearson MD, Montgomery P, Doherty M et al (2007) Conrad: gene prediction using conditional random fields. Genome Res 17(9):1389–1398

    Article  CAS  PubMed  Google Scholar 

  • Durbin R, Eddy SR, Mitchison AKG (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, p 356

    Google Scholar 

  • Eisen JA, Coyne RS, Wu M, Wu D, Thiagarajan M, Wortman JR, Badger JH, Ren Q, Amedeo P, Jones KM, Tallon LJ, Delcher AL, Salzberg SL, Silva JC, Haas BJ, Majoros WH, Farzad M, Carlton JM, Smith RK, Garg J, Pearlman RE, Karrer KM, Sun L, Manning G, Elde NC, Turkewitz AP, Asai DJ, Wilkes DE, Wang Y, Cai H, Collins K, Stewart BA, Lee SR, Wilamowska K, Weinberg Z, Ruzzo WL, Wloga D, Gaertig J, Frankel J, Tsao CC, Gorovsky MA, Keeling PJ, Waller RF, Patron NJ, Cherry JM, Stover NA, Krieger CJ, Del Toro C, Ryder HF, Williamson SC, Barbeau RA, Hamilton EP, Orias E (2006) Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote. PLoS Biol 4(9):e286

    Article  PubMed  Google Scholar 

  • Fariselli P, Martelli PL, Casadio R (2005) A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins. BMC Bioinformatics 6(Suppl 4):S12

    Article  PubMed  Google Scholar 

  • Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17(6):368–376

    Article  CAS  PubMed  Google Scholar 

  • Florea L, Di Francesco V, Miller J, Turner R, Yao A, Harris M, Walenz B, Mobarry C, Merkulov GV, Charlab R, Dew I, Deng Z, Istrail S, Li P, Sutton G (2005) Gene and alternative splicing annotation with AIR. Genome Res 15:54–66

    Article  CAS  PubMed  Google Scholar 

  • Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8(9):967–974

    CAS  PubMed  Google Scholar 

  • Gibbs RA, Weinstock GM, Metzker ML, Muzny DM, Sodergren EJ et al (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428(6982):493–521

    Article  CAS  PubMed  Google Scholar 

  • Gish W, States DJ (1993) Identification of protein coding regions by database similarity search. Nat Genet 3:266–272

    Article  CAS  PubMed  Google Scholar 

  • Gross SS, Do CB, Sirota M, Batzoglou S (2008) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol 8(12):R269

    Article  Google Scholar 

  • Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J et al (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(Suppl 1):S21–S31

    Article  Google Scholar 

  • Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr et al (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31(19):5654–5666

    Article  CAS  PubMed  Google Scholar 

  • Holmes I, Bruno WJ (2001) Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics 17:803–820

    Article  CAS  PubMed  Google Scholar 

  • Hubbard T, Barker D, Birney E, Cameron G, Chen Y et al (2002) The Ensembl genome database project. Nucleic Acids Res 30(1):38–41

    Article  CAS  PubMed  Google Scholar 

  • Jaakkola TS, Haussler D (1999) Exploiting generative models in discriminative classifiers. Adv Neural Inf Process Syst 11:487–493

    Google Scholar 

  • JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. Genome Biology 2006, 7(Suppl):S9

    Google Scholar 

  • Jukes T, Cantor C (1969) Evolution of protein molecules. In: Munro H (ed) Mammalian protein metabolism. Academic, New York, NY, pp 21–132

    Google Scholar 

  • Kall L, Krogh A, Sonnhammer EL (2005) An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics (Oxford, England) 21(Suppl 1):251–257

    Article  Google Scholar 

  • Kampa D, Cheng J, Kapranov P, Yamanaka M, Brubaker S et al (2004) Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res 14(3):331–342

    Article  CAS  PubMed  Google Scholar 

  • Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12(4):656–664

    CAS  PubMed  Google Scholar 

  • Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D (2002) The human genome browser at UCSC. Genome Res 12:996–1006

    CAS  PubMed  Google Scholar 

  • Korf I (2004) Gene finding in novel genomes. BMC Bioinformatics 5:59

    Article  PubMed  Google Scholar 

  • Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics (Oxford, England) 17(Suppl 1):S140–S148

    Google Scholar 

  • Krogh A (1997) Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol 5:179–186

    CAS  PubMed  Google Scholar 

  • Kulp D, Haussler D, Reese MG, Eeckman FH (1996) A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol 4:134–142

    CAS  PubMed  Google Scholar 

  • Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL (2004) Versatile and open software for comparing large genomes. Genome Biol 5:R12

    Article  PubMed  Google Scholar 

  • Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc 18th International Conf on Machine Learning

    Google Scholar 

  • Li L, Wang X, Sasidharan R, Stolc V, Deng W et al (2007) Global identification and characterization of transcriptionally active regions in the rice genome. PLoS ONE 2(3):e294

    Article  PubMed  Google Scholar 

  • Li M, Ma B, Kisman D, Tromp J (2004) Patternhunter II: highly sensitive and fast homology search. J Bioinform Comput Biol 2(3):417–439

    Article  CAS  PubMed  Google Scholar 

  • Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 33(20):6494–6506

    Article  CAS  PubMed  Google Scholar 

  • Lukashin AV, Borodovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26(4):1107–1115

    Article  CAS  PubMed  Google Scholar 

  • Majoros W (2007) Methods for Computational Gene Prediction: Cambridge University Press.

    Google Scholar 

  • Majoros W (2007) Conditional random fields. Supplement to: methods for computational gene prediction. http://www.geneprediction.org/book/supplementary.html

  • Majoros WH, Salzberg SL (2004) An empirical analysis of training protocols for probabilistic gene finders. BMC Bioinformatics 5:206

    Article  PubMed  Google Scholar 

  • Majoros WH, Pertea M, Salzberg SL (2004) TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics (Oxford, England) 20(16):2878–2879

    Article  CAS  Google Scholar 

  • Majoros WM, Pertea M, Delcher AL, Salzberg SL (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 6:16

    Article  PubMed  Google Scholar 

  • Majoros WH, Pertea M, Salzberg SL (2005) Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics (Oxford, England) 21(9):1782–1788

    Article  CAS  Google Scholar 

  • Meyer IM, Durbin R (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics (Oxford, England) 18(10):1309–1318

    Article  CAS  Google Scholar 

  • Meyer IM, Durbin R (2004) Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res 32:776–783

    Article  CAS  PubMed  Google Scholar 

  • Meyers BC, Vu TH, Tej SS, Ghazal H, Matvienko M et al (2004) Analysis of the transcriptional complexity of Arabidopsis thaliana by massively parallel signature sequencing. Nat Biotechnol 22(8):1006–1011

    Article  CAS  PubMed  Google Scholar 

  • Moses AM, Chiang DY, Eisen MB (2004) Phylogenetic motif detection by expectation maximization on evolutionary mixtures. Pac Symp Biocomput 9:325–335

    Google Scholar 

  • Ng AY, Jordan MI (2002) On discriminative vs generative classifiers: a comparison of logistic regression and naive Bayes. In: Dietterich T, Becker S, Ghahramani Z (eds.), Advances in Neural Information Processing Systems (NIPS) 14

    Google Scholar 

  • Normark S, Bergstrom S, Edlund T, Grundstrom T, Jaurin B, Lindberg FP, Olsson O (1983) Overlapping genes. Annual Review of Genetics 17:499–525

    Article  CAS  PubMed  Google Scholar 

  • Parra G, Blanco E, Guigo R (2000) GeneID in Drosophila. Genome Res 10(4):511–515

    Article  CAS  PubMed  Google Scholar 

  • Parra G, Agarwal P, Abril JF, Wiehe T, Fickett JW et al (2003) Comparative gene prediction in human and mouse. Genome Res 13(1):108–117

    Article  CAS  PubMed  Google Scholar 

  • Pavesi A, De Iaco B, Granero MI, Porati A (1997) On the informational content of overlapping genes in prokaryotic and eukaryotic viruses. Journal of Molecular Evolution 44:625–631

    Article  CAS  PubMed  Google Scholar 

  • Pearson WR, Wood T, Zhang Z, Miller W (1997) Comparison of DNA sequences with protein sequences. Genomics 46:24–36

    Article  CAS  PubMed  Google Scholar 

  • Pedersen JS, Hein J (2003) Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics (Oxford, England) 19(2):219–227

    Article  CAS  Google Scholar 

  • Reese MG, Kulp D, Tammana H, Haussler D (2000) Genie–gene finding in Drosophila melanogaster. Genome Res 10(4):529–538

    Article  CAS  PubMed  Google Scholar 

  • Roest Crollius H, Jaillon O, Bernot A, Dasilva C, Bouneau L et al (2000) Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat Genet 25(2):235–238

    Article  CAS  PubMed  Google Scholar 

  • Salamov AA, Solovyev VV (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res 10(4):516–522

    Article  CAS  PubMed  Google Scholar 

  • Schulze U, Hepp B, Ong CS, Ratsch G (2007) PALMA: mRNA to genome alignments using large margin algorithms. Bioinformatics (Oxford, England) 23(15):1892–1900

    Article  CAS  Google Scholar 

  • Seki M, Narusaka M, Kamiya A, Ishida J, Satou M et al (2002) Functional annotation of a full-length Arabidopsis cDNA collection. Science (New York, NY) 296(5565):141–145

    Article  Google Scholar 

  • Siepel A, Haussler D (2004) Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol 21(3):468–488

    Article  CAS  PubMed  Google Scholar 

  • Siepel A, Diekhans M, Brejova B, Langton L, Stevens M et al (2007) Targeted discovery of novel human exons by comparative genomics. Genome Res 17(12):1763–1773

    Article  CAS  PubMed  Google Scholar 

  • Slater GS, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:31

    Article  PubMed  Google Scholar 

  • Sonnhammer EL, Eddy SR, Durbin R (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28(3):405–420

    Article  CAS  PubMed  Google Scholar 

  • Stanke M, Waack S (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics (Oxford, England) 19(Suppl 2):ii215–ii225

    Google Scholar 

  • Stanke M, Schoffmann O, Morgenstern B, Waack S (2006) Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7:62

    Article  PubMed  Google Scholar 

  • Strausberg RL, Feingold EA, Grouse LH, Derge JG, Klausner RD et al (2002) Generation and initial analysis of more than 15, 000 full-length human and mouse cDNA sequences. Proc Natl Acad Sci USA 99(26):16899–16903

    Article  PubMed  Google Scholar 

  • Tong S, Koller D (2000) Restricted Bayes optimal classifiers. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence, pp 658-664.

    Google Scholar 

  • Uniprot Consortium (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res 35:D193–D197

    Article  Google Scholar 

  • Vinson J, DeCaprio D, Luoma S, Galagan JE (2006) Gene prediction using conditional random fields (abstract). In: The Biology of Genomes, Cold Spring Harbor Laboratory, New York, May 10-14, 2006.

    Google Scholar 

  • Weber AP, Weber KL, Carr K, Wilkerson C, Ohlrogge JB (2007) Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol 144(1):32–42

    Article  CAS  PubMed  Google Scholar 

  • Wei C, Brent MR (2006) Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 7:327

    Article  PubMed  Google Scholar 

  • Yandell M, Bailey AM, Misra S, Shu S, Wiel C et al (2005) A computational and experimental approach to validating annotations and gene predictions in the Drosophila melanogaster genome. Proc Natl Acad Sci USA 102(5):1566–1571

    Article  CAS  PubMed  Google Scholar 

  • Yu P, Ma D, Xu M (2005) Nested genes in the human genome. Genomics 86:414–422.

    Article  Google Scholar 

  • Zhang M, Gish W (2006) Improved spliced alignment from an information theoretic approach. Bioinformatics (Oxford, England) 22(1):13–20

    Article  Google Scholar 

  • Zhang MQ, Marr GT (1993) A weight array method for splicing signal analysis. Comput Appl Biosci 9:499–509

    CAS  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to William H. Majoros .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Majoros, W.H., Korf, I., Ohler, U. (2009). Gene Prediction Methods. In: Edwards, D., Stajich, J., Hansen, D. (eds) Bioinformatics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-92738-1_5

Download citation

Publish with us

Policies and ethics