The Classification of Protein Domains

Dawson, Natalie; Sillitoe, Ian; Marsden, Russell L.; Orengo, Christine A.

doi:10.1007/978-1-4939-6622-6_7

Natalie Dawson³,
Ian Sillitoe³,
Russell L. Marsden³ &
…
Christine A. Orengo³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1525))

6685 Accesses
13 Citations

Abstract

The significant expansion in protein sequence and structure data that we are now witnessing brings with it a pressing need to bring order to the protein world. Such order enables us to gain insights into the evolution of proteins, their function and the extent to which the functional repertoire can vary across the three kingdoms of life. This has lead to the creation of a wide range of protein family classifications that aim to group proteins based upon their evolutionary relationships.

In this chapter we discuss the approaches and methods that are frequently used in the classification of proteins, with a specific emphasis on the classification of protein domains. The construction of both domain sequence and domain structure databases is considered and we show how the use of domain family annotations to assign structural and functional information is enhancing our understanding of genomes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Fleischmann R et al (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269:496–512
Article CAS PubMed Google Scholar
Reddy TBK et al (2015) The Genomes OnLine Database (GOLD) v. 5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Res 43:D1099–D1106
Article CAS PubMed Google Scholar
Vogel C, Bashton M, Kerrison ND, Chothia C, Teichmann SA (2004) Structure, function and evolution of multidomain proteins. Curr Opin Struct Biol 14:208–216
Article CAS PubMed Google Scholar
Marsden RL, Lee D, Maibaum M, Yeats C, Orengo CA (2006) Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space. Nucleic Acids Res 34:1066–1080
Article CAS PubMed PubMed Central Google Scholar
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
Article CAS PubMed Google Scholar
Pearson WR, Lipman DJ (1988) Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85:2444–2448
Article CAS PubMed PubMed Central Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Article CAS PubMed Google Scholar
Ponting CP (2001) Issues in predicting protein function from sequence. Brief Bioinform 2:19–29
Article CAS PubMed Google Scholar
Bru C et al (2005) The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res 33:D212–D215
Article CAS PubMed Google Scholar
Portugaly E, Linial N, Linial M (2007) EVEREST: a collection of evolutionary conserved protein domains. Nucleic Acids Res 35:D241–D246
Article CAS PubMed Google Scholar
Heger A (2004) ADDA: a domain database with global coverage of the protein universe. Nucleic Acids Res 33:D188–D191
Article PubMed Central Google Scholar
The UniProt Consortium (2014) UniProt: a hub for protein information. Nucleic Acids Res 43:D204–D212
Article PubMed Central Google Scholar
Altschul SF et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article CAS PubMed PubMed Central Google Scholar
Kelil A, Wang S, Brzezinski R, Fleury A (2007) CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinformatics 8:286
Article PubMed PubMed Central Google Scholar
Gnanavel M et al (2014) CLAP: a web-server for automatic classification of proteins with special reference to multi-domain proteins. BMC Bioinformatics 15:343
Article PubMed PubMed Central Google Scholar
Krishnamurthy N, Brown DP, Kirshner D, Sjölander K (2006) PhyloFacts: an online structural phylogenomic encyclopedia for protein functional and structural classification. Genome Biol 7:R83
Article PubMed PubMed Central Google Scholar
Loewenstein Y, Portugaly E, Fromer M, Linial M (2008) Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics 24:i41–i49
Article CAS PubMed PubMed Central Google Scholar
Enright AJ, Kunin V, Ouzounis CA (2003) Protein families and TRIBES in genome sequence space. Nucleic Acids Res 31:4632–4638
Article CAS PubMed PubMed Central Google Scholar
Edgar RC (2010) Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26:2460–2461
Article CAS PubMed Google Scholar
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659
Article CAS PubMed Google Scholar
Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28:3150–3152
Article CAS PubMed PubMed Central Google Scholar
Hauser M, Mayer CE, Söding J (2013) kClust: fast and sensitive clustering of large protein sequence databases. BMC Bioinformatics 14:248
Article PubMed PubMed Central Google Scholar
Feng DF, Doolittle RF (1996) Progressive alignment of amino acid sequences and construction of phylogenetic trees from them. Methods Enzymol 266:368–382
Article CAS PubMed Google Scholar
Eddy SR (1996) Hidden Markov models. Curr Opin Struct Biol 6:361–365
Article CAS PubMed Google Scholar
Finn RD et al (2015) HMMER web server: 2015 update. Nucleic Acids Res 43:W30–W38
Article PubMed PubMed Central Google Scholar
Remmert M, Biegert A, Hauser A, Söding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175
Article CAS Google Scholar
Mitchell A et al (2015) The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res 43:D213–D221
Article PubMed Google Scholar
Sillitoe I et al (2015) CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res 43:D376–D381
Article PubMed Google Scholar
Pedruzzi I et al (2014) HAMAP in 2015: updates to the protein family classification and annotation system. Nucleic Acids Res 43:D1064–D1070
Article PubMed PubMed Central Google Scholar
Mi H, Muruganujan A, Thomas PD (2013) PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees. Nucleic Acids Res 41:D377–D386
Article CAS PubMed Google Scholar
Nikolskayaw QN, Arighi CN, Huang H, Barker WC, Wu CH (2006) PIRSF family classification system for protein functional and evolutionary analysis. Evol Bioinforma 2:197–209
Google Scholar
Finn RD et al (2014) Pfam: the protein families database. Nucleic Acids Res 42:D222–D230
Article CAS PubMed Google Scholar
Attwood TK et al (2012) The PRINTS database: a fine-grained protein sequence annotation and analysis resource—its status in 2012. Database (Oxford) 2012:bas019
Google Scholar
Sigrist CJA et al (2013) New and continuing developments at PROSITE. Nucleic Acids Res 41:D344–D347
Article CAS PubMed Google Scholar
Letunic I, Doerks T, Bork P (2015) SMART: recent updates, new developments and status in 2015. Nucleic Acids Res 43:D257–D260
Article PubMed Google Scholar
Oates ME et al (2015) The SUPERFAMILY 1.75 database in 2014: a doubling of data. Nucleic Acids Res 43:D227–D233
Article PubMed Google Scholar
Haft DH et al (2013) TIGRFAMs and genome properties in 2013. Nucleic Acids Res 41:D387–D395
Article CAS PubMed Google Scholar
Heger A, Holm L (2003) Exhaustive enumeration of protein domain families. J Mol Biol 328:749–767
Article CAS PubMed Google Scholar
Penel S et al (2009) Databases of homologous gene families for comparative genomics. BMC Bioinformatics 10(Suppl 6):S3
Article PubMed PubMed Central Google Scholar
Kriventseva EV et al (2015) OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software. Nucleic Acids Res 43:D250–D256
Article PubMed Google Scholar
Jones P et al (2014) InterProScan 5: genome-scale protein function classification. Bioinformatics 30:1236–1240
Article CAS PubMed PubMed Central Google Scholar
Petryszak R, Kretschmann E, Wieser D, Apweiler R (2005) The predictive power of the CluSTr database. Bioinformatics 21:3604–3609
Article CAS PubMed Google Scholar
Thomas PD (2010) GIGA: a simple, efficient algorithm for gene tree inference in the genomic age. BMC Bioinformatics 11:312
Article PubMed PubMed Central Google Scholar
Wu CH et al (2004) PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res 32:D112–D114
Article CAS PubMed PubMed Central Google Scholar
Haft DH, Selengut JD, White O (2003) The TIGRFAMs database of protein families. Nucleic Acids Res 31:371–373
Article CAS PubMed PubMed Central Google Scholar
Berman H, Henrick K, Nakamura H (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol 10:980
Article CAS PubMed Google Scholar
Richardson JS (1981) The anatomy and taxonomy of protein structure. Adv Protein Chem 34:167–339
Article CAS PubMed Google Scholar
Murzin A, Brenner S, Hubbard T, Chothia C (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247:536–540
CAS PubMed Google Scholar
Orengo CA et al (1997) CATH—a hierarchic classification of protein domain structures. Structure 5:1093–1108
Article CAS PubMed Google Scholar
Holm L, Sander C (1998) Dictionary of recurrent domains in protein structures. Proteins 33:88–96
Article CAS PubMed Google Scholar
Sowdhamini R, Rufino SD, Blundell TL (1996) A database of globular protein structural domains: clustering of representative family members into similar folds. Fold Des 1:209–220
Article CAS PubMed Google Scholar
Gibrat JF, Madej T, Bryant SH (1996) Surprising similarities in structure comparison. Curr Opin Struct Biol 6:377–385
Article CAS PubMed Google Scholar
Redfern OC, Harrison A, Dallman T, Pearl FMG, Orengo CA (2007) CATHEDRAL: a fast and effective algorithm to predict folds and domain boundaries from multidomain protein structures. PLoS Comput Biol 3:e232
Article PubMed PubMed Central Google Scholar
Taylor W, Orengo CA (1989) Protein structure alignment. J Mol Biol 208:1–22
Article CAS PubMed Google Scholar
Holm L, Sander C (1993) Protein structure comparison by alignment of distance matrices. J Mol Biol 233:123–138
Article CAS PubMed Google Scholar
Ye Y, Godzik A (2003) Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 19:ii246–ii255
PubMed Google Scholar
Subbiah S, Laurents DV, Levitt M (1993) Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. Curr Biol 3:141–148
Article CAS PubMed Google Scholar
Gerstein M, Levitt M (1998) Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Sci 7:445–456
Article CAS PubMed PubMed Central Google Scholar
Kolodny R, Koehl P, Levitt M (2005) Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J Mol Biol 346:1173–1188
Article CAS PubMed PubMed Central Google Scholar
Dayhoff MO (2005) Atlas of protein sequence and structure. Natl. Biomed. Res. Foundation
Google Scholar
Orengo CA, Jones DT, Thornton JM (1994) Protein superfamilles and domain superfolds. Nature 372:631–634
Article CAS PubMed Google Scholar
Andreeva A, Howorth D, Chothia C, Kulesha E, Murzin AG (2014) SCOP2 prototype: a new approach to protein structure mining. Nucleic Acids Res 42:D310–D314
Article CAS PubMed Google Scholar
Das S et al (2015) Functional classification of CATH superfamilies: a domain-based approach for protein function annotation. Bioinformatics 31:3460–3467
Article CAS PubMed PubMed Central Google Scholar
Lee DA, Rentzsch R, Orengo C (2010) GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains. Nucleic Acids Res 38:720–737
Article CAS PubMed Google Scholar
Holm L, Sander C (1994) Parser for protein folding units. Proteins 19:256–268
Article CAS PubMed Google Scholar
Marchler-Bauer A et al (2014) CDD: NCBI’s conserved domain database. Nucleic Acids Res 43:D222–D226
Article PubMed PubMed Central Google Scholar
Shindyalov IN, Bourne PE (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 11:739–747
Article CAS PubMed Google Scholar
Krissinel E, Henrick K (2004) Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr D Biol Crystallogr 60:2256–2268
Article CAS PubMed Google Scholar
Fox NK, Brenner SE, Chandonia J-MM (2014) SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic Acids Res 42:D304–D309
Article CAS PubMed Google Scholar
Andreeva A et al (2007) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 36:D419–D425
Article PubMed PubMed Central Google Scholar
Cheng H et al (2014) ECOD: an evolutionary classification of protein domains. PLoS Comput Biol 10:e1003926
Article PubMed PubMed Central Google Scholar
Sowdhamini R et al (1998) Protein three-dimensional structural databases: domains, structurally aligned homologues and superfamilies. Acta Crystallogr D Biol Crystallogr 54:1168–1177
Article CAS PubMed Google Scholar
Orengo CA (1999) CORA—topological fingerprints for protein structural families. Protein Sci 8:699–715
Article CAS PubMed PubMed Central Google Scholar
Orengo CA, Taylor WR (1996) In: Computer methods for macromolecular sequence analysis, vol 266. Elsevier, Amsterdam, pp 617–635
Google Scholar
Cuff A, Redfern O, Dessailly B, Orengo C (2011) In Protein function prediction for omics era. Springer, Netherlands
Google Scholar
Furnham N et al (2012) FunTree: a resource for exploring the functional evolution of structurally defined enzyme superfamilies. Nucleic Acids Res 40:D776–D782
Article CAS PubMed Google Scholar
Furnham N et al (2012) Exploring the evolution of novel enzyme functions within structurally defined protein superfamilies. PLoS Comput Biol 8:e1002403
Article CAS PubMed PubMed Central Google Scholar
Barrett AJ (1992) Enzyme nomenclature: Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology. Academic, San Diego, CA
Google Scholar
Hadley C, Jones DT (1999) A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. Structure 7:1099–1112
Article CAS PubMed Google Scholar
Lupas AN, Ponting CP, Russell RB (2001) On the evolution of protein folds: are similar motifs in different protein folds the result of convergence, insertion, or relics of an ancient peptide world? J Struct Biol 134:191–203
Article CAS PubMed Google Scholar
Park J et al (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 284:1201–1210
Article CAS PubMed Google Scholar
Gough J, Chothia C (2002) SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res 30:268–272
Article CAS PubMed PubMed Central Google Scholar
Yeats C et al (2006) Gene3D: modelling protein structure, function and evolution. Nucleic Acids Res 34:D281–D284
Article CAS PubMed Google Scholar
Todd AE, Marsden RL, Thornton JM, Orengo CA (2005) Progress of structural genomics initiatives: an analysis of solved target structures. J Mol Biol 348:1235–1260
Article CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Structural and Molecular Biology, University College London, Gower Street, London, WC1E 6BT, UK
Natalie Dawson, Ian Sillitoe, Russell L. Marsden & Christine A. Orengo

Authors

Natalie Dawson
View author publications
You can also search for this author in PubMed Google Scholar
Ian Sillitoe
View author publications
You can also search for this author in PubMed Google Scholar
Russell L. Marsden
View author publications
You can also search for this author in PubMed Google Scholar
Christine A. Orengo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Natalie Dawson .

Editor information

Editors and Affiliations

Monash University, Melbourne, Victoria, Australia
Jonathan M. Keith

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Dawson, N., Sillitoe, I., Marsden, R.L., Orengo, C.A. (2017). The Classification of Protein Domains. In: Keith, J. (eds) Bioinformatics. Methods in Molecular Biology, vol 1525. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6622-6_7

Download citation

DOI: https://doi.org/10.1007/978-1-4939-6622-6_7
Published: 29 November 2016
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-6620-2
Online ISBN: 978-1-4939-6622-6
eBook Packages: Springer Protocols

Publish with us

Policies and ethics