Abstract
The significant expansion in protein sequence and structure data that we are now witnessing brings with it a pressing need to bring order to the protein world. Such order enables us to gain insights into the evolution of proteins, their function and the extent to which the functional repertoire can vary across the three kingdoms of life. This has lead to the creation of a wide range of protein family classifications that aim to group proteins based upon their evolutionary relationships.
In this chapter we discuss the approaches and methods that are frequently used in the classification of proteins, with a specific emphasis on the classification of protein domains. The construction of both domain sequence and domain structure databases is considered and we show how the use of domain family annotations to assign structural and functional information is enhancing our understanding of genomes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Fleischmann R et al (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–512
Reddy TBK et al (2015) The Genomes OnLine Database (GOLD) v. 5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res 43:D1099–D1106
Vogel C, Bashton M, Kerrison ND, Chothia C, Teichmann SA (2004) Structure, function and evolution of multidomain proteins. Curr Opin Struct Biol 14:208–216
Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA (2006) Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res 34:1066–1080
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85:2444–2448
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Ponting CP (2001) Issues in predicting protein function from sequence. Brief Bioinform 2:19–29
Bru C et al (2005) The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res 33:D212–D215
Portugaly E, Linial N, Linial M (2007) EVEREST: a collection of evolutionary conserved protein domains. Nucleic Acids Res 35:D241–D246
Heger A (2004) ADDA: a domain database with global coverage of the protein universe. Nucleic Acids Res 33:D188–D191
The UniProt Consortium (2014) UniProt: a hub for protein information. Nucleic Acids Res 43:D204–D212
Altschul SF et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Kelil A, Wang S, Brzezinski R, Fleury A (2007) CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinformatics 8:286
Gnanavel M et al (2014) CLAP: a web-server for automatic classification of proteins with special reference to multi-domain proteins. BMC Bioinformatics 15:343
Krishnamurthy N, Brown DP, Kirshner D, Sjölander K (2006) PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biol 7:R83
Loewenstein Y, Portugaly E, Fromer M, Linial M (2008) Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics 24:i41–i49
Enright AJ, Kunin V, Ouzounis CA (2003) Protein families and TRIBES in genome sequence space. Nucleic Acids Res 31:4632–4638
Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26:2460–2461
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659
Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150–3152
Hauser M, Mayer CE, Söding J (2013) kClust: fast and sensitive clustering of large protein sequence databases. BMC Bioinformatics 14:248
Feng DF, Doolittle RF (1996) Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. Methods Enzymol 266:368–382
Eddy SR (1996) Hidden Markov models. Curr Opin Struct Biol 6:361–365
Finn RD et al (2015) HMMER web server: 2015 update. Nucleic Acids Res 43:W30–W38
Remmert M, Biegert A, Hauser A, Söding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175
Mitchell A et al (2015) The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res 43:D213–D221
Sillitoe I et al (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 43:D376–D381
Pedruzzi I et al (2014) HAMAP in 2015: updates to the protein family classification and annotation system. Nucleic Acids Res 43:D1064–D1070
Mi H, Muruganujan A, Thomas PD (2013) PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res 41:D377–D386
Nikolskayaw QN, Arighi CN, Huang H, Barker WC, Wu CH (2006) PIRSF family classification system for protein functional and evolutionary analysis. Evol Bioinforma 2:197–209
Finn RD et al (2014) Pfam: the protein families database. Nucleic Acids Res 42:D222–D230
Attwood TK et al (2012) The PRINTS database: a fine-grained protein sequence annotation and analysis resource—its status in 2012. Database (Oxford) 2012:bas019
Sigrist CJA et al (2013) New and continuing developments at PROSITE. Nucleic Acids Res 41:D344–D347
Letunic I, Doerks T, Bork P (2015) SMART: recent updates, new developments and status in 2015. Nucleic Acids Res 43:D257–D260
Oates ME et al (2015) The SUPERFAMILY 1.75 database in 2014: a doubling of data. Nucleic Acids Res 43:D227–D233
Haft DH et al (2013) TIGRFAMs and genome properties in 2013. Nucleic Acids Res 41:D387–D395
Heger A, Holm L (2003) Exhaustive enumeration of protein domain families. J Mol Biol 328:749–767
Penel S et al (2009) Databases of homologous gene families for comparative genomics. BMC Bioinformatics 10(Suppl 6):S3
Kriventseva EV et al (2015) OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software. Nucleic Acids Res 43:D250–D256
Jones P et al (2014) InterProScan 5: genome-scale protein function classification. Bioinformatics 30:1236–1240
Petryszak R, Kretschmann E, Wieser D, Apweiler R (2005) The predictive power of the CluSTr database. Bioinformatics 21:3604–3609
Thomas PD (2010) GIGA: a simple, efficient algorithm for gene tree inference in the genomic age. BMC Bioinformatics 11:312
Wu CH et al (2004) PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res 32:D112–D114
Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families. Nucleic Acids Res 31:371–373
Berman H, Henrick K, Nakamura H (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol 10:980
Richardson JS (1981) The anatomy and taxonomy of protein structure. Adv Protein Chem 34:167–339
Murzin A, Brenner S, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540
Orengo CA et al (1997) CATH—a hierarchic classification of protein domain structures. Structure 5:1093–1108
Holm L, Sander C (1998) Dictionary of recurrent domains in protein structures. Proteins 33:88–96
Sowdhamini R, Rufino SD, Blundell TL (1996) A database of globular protein structural domains: clustering of representative family members into similar folds. Fold Des 1:209–220
Gibrat JF, Madej T, Bryant SH (1996) Surprising similarities in structure comparison. Curr Opin Struct Biol 6:377–385
Redfern OC, Harrison A, Dallman T, Pearl FMG, Orengo CA (2007) CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Comput Biol 3:e232
Taylor W, Orengo CA (1989) Protein structure alignment. J Mol Biol 208:1–22
Holm L, Sander C (1993) Protein structure comparison by alignment of distance matrices. J Mol Biol 233:123–138
Ye Y, Godzik A (2003) Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19:ii246–ii255
Subbiah S, Laurents DV, Levitt M (1993) Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. Curr Biol 3:141–148
Gerstein M, Levitt M (1998) Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Sci 7:445–456
Kolodny R, Koehl P, Levitt M (2005) Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J Mol Biol 346:1173–1188
Dayhoff MO (2005) Atlas of protein sequence and structure. Natl. Biomed. Res. Foundation
Orengo CA, Jones DT, Thornton JM (1994) Protein superfamilles and domain superfolds. Nature 372:631–634
Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG (2014) SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res 42:D310–D314
Das S et al (2015) Functional classification of CATH superfamilies: a domain-based approach for protein function annotation. Bioinformatics 31:3460–3467
Lee DA, Rentzsch R, Orengo C (2010) GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains. Nucleic Acids Res 38:720–737
Holm L, Sander C (1994) Parser for protein folding units. Proteins 19:256–268
Marchler-Bauer A et al (2014) CDD: NCBI’s conserved domain database. Nucleic Acids Res 43:D222–D226
Shindyalov IN, Bourne PE (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 11:739–747
Krissinel E, Henrick K (2004) Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr 60:2256–2268
Fox NK, Brenner SE, Chandonia J-MM (2014) SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res 42:D304–D309
Andreeva A et al (2007) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36:D419–D425
Cheng H et al (2014) ECOD: an evolutionary classification of protein domains. PLoS Comput Biol 10:e1003926
Sowdhamini R et al (1998) Protein three-dimensional structural databases: domains, structurally aligned homologues and superfamilies. Acta Crystallogr D Biol Crystallogr 54:1168–1177
Orengo CA (1999) CORA—topological fingerprints for protein structural families. Protein Sci 8:699–715
Orengo CA, Taylor WR (1996) In: Computer methods for macromolecular sequence analysis, vol 266. Elsevier, Amsterdam, pp 617–635
Cuff A, Redfern O, Dessailly B, Orengo C (2011) In Protein function prediction for omics era. Springer, Netherlands
Furnham N et al (2012) FunTree: a resource for exploring the functional evolution of structurally defined enzyme superfamilies. Nucleic Acids Res 40:D776–D782
Furnham N et al (2012) Exploring the evolution of novel enzyme functions within structurally defined protein superfamilies. PLoS Comput Biol 8:e1002403
Barrett AJ (1992) Enzyme nomenclature: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology. Academic, San Diego, CA
Hadley C, Jones DT (1999) A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure 7:1099–1112
Lupas AN, Ponting CP, Russell RB (2001) On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J Struct Biol 134:191–203
Park J et al (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 284:1201–1210
Gough J, Chothia C (2002) SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res 30:268–272
Yeats C et al (2006) Gene3D: modelling protein structure, function and evolution. Nucleic Acids Res 34:D281–D284
Todd AE, Marsden RL, Thornton JM, Orengo CA (2005) Progress of structural genomics initiatives: an analysis of solved target structures. J Mol Biol 348:1235–1260
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Science+Business Media New York
About this protocol
Cite this protocol
Dawson, N., Sillitoe, I., Marsden, R.L., Orengo, C.A. (2017). The Classification of Protein Domains. In: Keith, J. (eds) Bioinformatics. Methods in Molecular Biology, vol 1525. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6622-6_7
Download citation
DOI: https://doi.org/10.1007/978-1-4939-6622-6_7
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-6620-2
Online ISBN: 978-1-4939-6622-6
eBook Packages: Springer Protocols