Gene Prediction Methods

Majoros, William H.; Korf, Ian; Ohler, Uwe

doi:10.1007/978-0-387-92738-1_5

William H. Majoros⁴,
Ian Korf &
Uwe Ohler

3595 Accesses
2 Citations

Abstract

Most computational gene-finding methods in current use are derived from the fields of natural language processing and speech recognition. These latter fields are concerned with parsing spoken or written language into functional components such as nouns, verbs, and phrases of various types. The parsing task is governed by a set of syntax rules that dictate which linguistic elements may immediately follow each other in well-formed sentences – for example,

$$subject \rightarrow verb,\, verb \rightarrow direct\, object,\, etc\ldots$$

The problem of gene-finding is rather similar to linguistic parsing in that we wish to partition a sequence of letters into elements of biological relevance, such as exons, introns, and the intergenic regions separating genes. That is, we wish to not only find the genes, but also to predict their internal exon-intron structure so that the encoded protein(s) may be deduced. Figure 5.1 illustrates this internal structure for a typical gene.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adams MD, Celniker SE, Holt RA, 194 co-authors et al (2000) The genome sequence of Dosophila melanogaster. Science 287:2185–2195
Article PubMed Google Scholar
Alexandersson M, Cawley S, Pachter L (2003) SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res 13(3):496–502
Article CAS PubMed Google Scholar
Allen JE, Salzberg SL (2005) JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics (Oxford, England) 21(18)):3596–3603
Article CAS Google Scholar
Allen JE, Salzberg SL (2006) A phylogenetic generalized hidden Markov model for predicting alternatively spliced exons. Algorithms Mol Biol 1:14
Article PubMed Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
CAS PubMed Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402
Article CAS PubMed Google Scholar
Bernal A, Crammer K, Hatzigeorgiou A, Pereira F (2007) Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput Biol 3(3):e54
Article PubMed Google Scholar
Birney E, Clamp M, Durbin R (2004) GeneWise and Genomewise. Genome Res 14(5):988–995
Article CAS PubMed Google Scholar
Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF et al (2004) Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res 14(4):708–715
Article CAS PubMed Google Scholar
Bray N, Pachter L (2004) MAVID: constrained ancestral alignment of multiple sequences. Genome Res 14(4):693–699
Article CAS PubMed Google Scholar
Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268(1):78–94
Article CAS PubMed Google Scholar
Cantarel BL, Korf I, Robb SM, Parra G, Ross E et al (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18(1):188–196
Article CAS PubMed Google Scholar
Castellano S, Lobanov AV, Chapple C, Novoselov SV, Albrecht M et al (2005) Diversity and functional plasticity of eukaryotic selenoproteins: identification and characterization of the SelJ family. Proc Natl Acad Sci USA 102(45):16188–16193
Article CAS PubMed Google Scholar
Cawley SE, Wirth AI, Speed TP (2001) Phat – a gene finding program for Plasmodium falciparum. Mol Biochem Parasitol 118(2):167–174
Article CAS PubMed Google Scholar
Cawley SL, Pachter L (2003) HMM sampling and applications to gene finding and alternative splicing. Bioinformatics (Oxford, England) 19(Suppl 2):ii36–ii41
Google Scholar
Cormen TH, Leiserson CE, Rivest RL (1992) Introduction to algorithms. MIT, Cambridge, MA
Google Scholar
DeCaprio D, Vinson JP, Pearson MD, Montgomery P, Doherty M et al (2007) Conrad: gene prediction using conditional random fields. Genome Res 17(9):1389–1398
Article CAS PubMed Google Scholar
Durbin R, Eddy SR, Mitchison AKG (1998) Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, p 356
Google Scholar
Eisen JA, Coyne RS, Wu M, Wu D, Thiagarajan M, Wortman JR, Badger JH, Ren Q, Amedeo P, Jones KM, Tallon LJ, Delcher AL, Salzberg SL, Silva JC, Haas BJ, Majoros WH, Farzad M, Carlton JM, Smith RK, Garg J, Pearlman RE, Karrer KM, Sun L, Manning G, Elde NC, Turkewitz AP, Asai DJ, Wilkes DE, Wang Y, Cai H, Collins K, Stewart BA, Lee SR, Wilamowska K, Weinberg Z, Ruzzo WL, Wloga D, Gaertig J, Frankel J, Tsao CC, Gorovsky MA, Keeling PJ, Waller RF, Patron NJ, Cherry JM, Stover NA, Krieger CJ, Del Toro C, Ryder HF, Williamson SC, Barbeau RA, Hamilton EP, Orias E (2006) Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote. PLoS Biol 4(9):e286
Article PubMed Google Scholar
Fariselli P, Martelli PL, Casadio R (2005) A new decoding algorithm for hidden Markov models improves the prediction of the topology of all-beta membrane proteins. BMC Bioinformatics 6(Suppl 4):S12
Article PubMed Google Scholar
Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17(6):368–376
Article CAS PubMed Google Scholar
Florea L, Di Francesco V, Miller J, Turner R, Yao A, Harris M, Walenz B, Mobarry C, Merkulov GV, Charlab R, Dew I, Deng Z, Istrail S, Li P, Sutton G (2005) Gene and alternative splicing annotation with AIR. Genome Res 15:54–66
Article CAS PubMed Google Scholar
Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8(9):967–974
CAS PubMed Google Scholar
Gibbs RA, Weinstock GM, Metzker ML, Muzny DM, Sodergren EJ et al (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428(6982):493–521
Article CAS PubMed Google Scholar
Gish W, States DJ (1993) Identification of protein coding regions by database similarity search. Nat Genet 3:266–272
Article CAS PubMed Google Scholar
Gross SS, Do CB, Sirota M, Batzoglou S (2008) CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol 8(12):R269
Article Google Scholar
Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J et al (2006) EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 7(Suppl 1):S21–S31
Article Google Scholar
Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr et al (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31(19):5654–5666
Article CAS PubMed Google Scholar
Holmes I, Bruno WJ (2001) Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics 17:803–820
Article CAS PubMed Google Scholar
Hubbard T, Barker D, Birney E, Cameron G, Chen Y et al (2002) The Ensembl genome database project. Nucleic Acids Res 30(1):38–41
Article CAS PubMed Google Scholar
Jaakkola TS, Haussler D (1999) Exploiting generative models in discriminative classifiers. Adv Neural Inf Process Syst 11:487–493
Google Scholar
JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the features of human genes in the ENCODE regions. J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. Genome Biology 2006, 7(Suppl):S9
Google Scholar
Jukes T, Cantor C (1969) Evolution of protein molecules. In: Munro H (ed) Mammalian protein metabolism. Academic, New York, NY, pp 21–132
Google Scholar
Kall L, Krogh A, Sonnhammer EL (2005) An HMM posterior decoder for sequence feature prediction that includes homology information. Bioinformatics (Oxford, England) 21(Suppl 1):251–257
Article Google Scholar
Kampa D, Cheng J, Kapranov P, Yamanaka M, Brubaker S et al (2004) Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res 14(3):331–342
Article CAS PubMed Google Scholar
Kent WJ (2002) BLAT–the BLAST-like alignment tool. Genome Res 12(4):656–664
CAS PubMed Google Scholar
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D (2002) The human genome browser at UCSC. Genome Res 12:996–1006
CAS PubMed Google Scholar
Korf I (2004) Gene finding in novel genomes. BMC Bioinformatics 5:59
Article PubMed Google Scholar
Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics (Oxford, England) 17(Suppl 1):S140–S148
Google Scholar
Krogh A (1997) Two methods for improving performance of an HMM and their application for gene finding. Proc Int Conf Intell Syst Mol Biol 5:179–186
CAS PubMed Google Scholar
Kulp D, Haussler D, Reese MG, Eeckman FH (1996) A generalized hidden Markov model for the recognition of human genes in DNA. Proc Int Conf Intell Syst Mol Biol 4:134–142
CAS PubMed Google Scholar
Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL (2004) Versatile and open software for comparing large genomes. Genome Biol 5:R12
Article PubMed Google Scholar
Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc 18th International Conf on Machine Learning
Google Scholar
Li L, Wang X, Sasidharan R, Stolc V, Deng W et al (2007) Global identification and characterization of transcriptionally active regions in the rice genome. PLoS ONE 2(3):e294
Article PubMed Google Scholar
Li M, Ma B, Kisman D, Tromp J (2004) Patternhunter II: highly sensitive and fast homology search. J Bioinform Comput Biol 2(3):417–439
Article CAS PubMed Google Scholar
Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res 33(20):6494–6506
Article CAS PubMed Google Scholar
Lukashin AV, Borodovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26(4):1107–1115
Article CAS PubMed Google Scholar
Majoros W (2007) Methods for Computational Gene Prediction: Cambridge University Press.
Google Scholar
Majoros W (2007) Conditional random fields. Supplement to: methods for computational gene prediction. http://www.geneprediction.org/book/supplementary.html
Majoros WH, Salzberg SL (2004) An empirical analysis of training protocols for probabilistic gene finders. BMC Bioinformatics 5:206
Article PubMed Google Scholar
Majoros WH, Pertea M, Salzberg SL (2004) TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics (Oxford, England) 20(16):2878–2879
Article CAS Google Scholar
Majoros WM, Pertea M, Delcher AL, Salzberg SL (2005) Efficient decoding algorithms for generalized hidden Markov model gene finders. BMC Bioinformatics 6:16
Article PubMed Google Scholar
Majoros WH, Pertea M, Salzberg SL (2005) Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Bioinformatics (Oxford, England) 21(9):1782–1788
Article CAS Google Scholar
Meyer IM, Durbin R (2002) Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics (Oxford, England) 18(10):1309–1318
Article CAS Google Scholar
Meyer IM, Durbin R (2004) Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res 32:776–783
Article CAS PubMed Google Scholar
Meyers BC, Vu TH, Tej SS, Ghazal H, Matvienko M et al (2004) Analysis of the transcriptional complexity of Arabidopsis thaliana by massively parallel signature sequencing. Nat Biotechnol 22(8):1006–1011
Article CAS PubMed Google Scholar
Moses AM, Chiang DY, Eisen MB (2004) Phylogenetic motif detection by expectation maximization on evolutionary mixtures. Pac Symp Biocomput 9:325–335
Google Scholar
Ng AY, Jordan MI (2002) On discriminative vs generative classifiers: a comparison of logistic regression and naive Bayes. In: Dietterich T, Becker S, Ghahramani Z (eds.), Advances in Neural Information Processing Systems (NIPS) 14
Google Scholar
Normark S, Bergstrom S, Edlund T, Grundstrom T, Jaurin B, Lindberg FP, Olsson O (1983) Overlapping genes. Annual Review of Genetics 17:499–525
Article CAS PubMed Google Scholar
Parra G, Blanco E, Guigo R (2000) GeneID in Drosophila. Genome Res 10(4):511–515
Article CAS PubMed Google Scholar
Parra G, Agarwal P, Abril JF, Wiehe T, Fickett JW et al (2003) Comparative gene prediction in human and mouse. Genome Res 13(1):108–117
Article CAS PubMed Google Scholar
Pavesi A, De Iaco B, Granero MI, Porati A (1997) On the informational content of overlapping genes in prokaryotic and eukaryotic viruses. Journal of Molecular Evolution 44:625–631
Article CAS PubMed Google Scholar
Pearson WR, Wood T, Zhang Z, Miller W (1997) Comparison of DNA sequences with protein sequences. Genomics 46:24–36
Article CAS PubMed Google Scholar
Pedersen JS, Hein J (2003) Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics (Oxford, England) 19(2):219–227
Article CAS Google Scholar
Reese MG, Kulp D, Tammana H, Haussler D (2000) Genie–gene finding in Drosophila melanogaster. Genome Res 10(4):529–538
Article CAS PubMed Google Scholar
Roest Crollius H, Jaillon O, Bernot A, Dasilva C, Bouneau L et al (2000) Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat Genet 25(2):235–238
Article CAS PubMed Google Scholar
Salamov AA, Solovyev VV (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res 10(4):516–522
Article CAS PubMed Google Scholar
Schulze U, Hepp B, Ong CS, Ratsch G (2007) PALMA: mRNA to genome alignments using large margin algorithms. Bioinformatics (Oxford, England) 23(15):1892–1900
Article CAS Google Scholar
Seki M, Narusaka M, Kamiya A, Ishida J, Satou M et al (2002) Functional annotation of a full-length Arabidopsis cDNA collection. Science (New York, NY) 296(5565):141–145
Article Google Scholar
Siepel A, Haussler D (2004) Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. Mol Biol Evol 21(3):468–488
Article CAS PubMed Google Scholar
Siepel A, Diekhans M, Brejova B, Langton L, Stevens M et al (2007) Targeted discovery of novel human exons by comparative genomics. Genome Res 17(12):1763–1773
Article CAS PubMed Google Scholar
Slater GS, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:31
Article PubMed Google Scholar
Sonnhammer EL, Eddy SR, Durbin R (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28(3):405–420
Article CAS PubMed Google Scholar
Stanke M, Waack S (2003) Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics (Oxford, England) 19(Suppl 2):ii215–ii225
Google Scholar
Stanke M, Schoffmann O, Morgenstern B, Waack S (2006) Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7:62
Article PubMed Google Scholar
Strausberg RL, Feingold EA, Grouse LH, Derge JG, Klausner RD et al (2002) Generation and initial analysis of more than 15, 000 full-length human and mouse cDNA sequences. Proc Natl Acad Sci USA 99(26):16899–16903
Article PubMed Google Scholar
Tong S, Koller D (2000) Restricted Bayes optimal classifiers. In: Proceedings of the Seventeenth National Conference on Artificial Intelligence, pp 658-664.
Google Scholar
Uniprot Consortium (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res 35:D193–D197
Article Google Scholar
Vinson J, DeCaprio D, Luoma S, Galagan JE (2006) Gene prediction using conditional random fields (abstract). In: The Biology of Genomes, Cold Spring Harbor Laboratory, New York, May 10-14, 2006.
Google Scholar
Weber AP, Weber KL, Carr K, Wilkerson C, Ohlrogge JB (2007) Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol 144(1):32–42
Article CAS PubMed Google Scholar
Wei C, Brent MR (2006) Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 7:327
Article PubMed Google Scholar
Yandell M, Bailey AM, Misra S, Shu S, Wiel C et al (2005) A computational and experimental approach to validating annotations and gene predictions in the Drosophila melanogaster genome. Proc Natl Acad Sci USA 102(5):1566–1571
Article CAS PubMed Google Scholar
Yu P, Ma D, Xu M (2005) Nested genes in the human genome. Genomics 86:414–422.
Article Google Scholar
Zhang M, Gish W (2006) Improved spliced alignment from an information theoretic approach. Bioinformatics (Oxford, England) 22(1):13–20
Article Google Scholar
Zhang MQ, Marr GT (1993) A weight array method for splicing signal analysis. Comput Appl Biosci 9:499–509
CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Genome Sciences & Policy, Duke University, Durham, NC, USA
William H. Majoros

Authors

William H. Majoros
View author publications
You can also search for this author in PubMed Google Scholar
Ian Korf
View author publications
You can also search for this author in PubMed Google Scholar
Uwe Ohler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to William H. Majoros .

Editor information

Editors and Affiliations

Inst. Molecular Bioscience, University of Queensland, St.Lucia, 4072, Australia
David Edwards
Dept. Plant & Microbial Biology, University of California, Berkeley, Koshland Hall 111, Berkeley, 94720, U.S.A.
Jason Stajich
e-Health Research Centre, Adelaide St. 300, Brisbane, 4000, Australia
David Hansen

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Majoros, W.H., Korf, I., Ohler, U. (2009). Gene Prediction Methods. In: Edwards, D., Stajich, J., Hansen, D. (eds) Bioinformatics. Springer, New York, NY. https://doi.org/10.1007/978-0-387-92738-1_5

Download citation

DOI: https://doi.org/10.1007/978-0-387-92738-1_5
Published: 05 August 2009
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-92737-4
Online ISBN: 978-0-387-92738-1
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics