Reducing Overfitting in Predicting Intrinsically Unstructured Proteins

Han, Pengfei; Zhang, Xiuzhen; Norton, Raymond S.; Feng, Zhiping

doi:10.1007/978-3-540-71701-0_53

Pengfei Han¹,
Xiuzhen Zhang¹,
Raymond S. Norton² &
…
Zhiping Feng²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4426))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1835 Accesses

Abstract

Intrinsically unstructured or disordered proteins are proteins that lack fixed 3-D structure globally or contain long disordered regions. Predicting disordered regions has attracted significant research recently. In developing a decision tree based disordered region predictor, we note that many previous predictors applying 20 amino acid compositions as training parameter tend to overfit the data. In this paper we propose to alleviate overfitting in prediction of intrinsically unstructured proteins by reducing input parameters. We also compare this approach with the random forest model, which is inherently tolerant to overfitting. Our experiments suggest that reducing 20 amino acid compositions into 4 groups according to amino acid property can reduce the overfitting in decision tree model. Alternatively, ensemble-learning techniques like random forest is inherently more tolerant to this kind of overfitting and can be a promising candidate in disordered region prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ward, J.J., et al.: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol 337, 635–645 (2004)
Article Google Scholar
Romero, P., et al.: Sequence complexity of disordered protein. Proteins: Structure, Function, and Genetics 42, 38–48 (2001)
Article Google Scholar
Coeytaux, K., Poupon, A.: Prediction of unfolded segments in a protein sequence based on amino acid composition. Bioinformatics 21, 1891–1900 (2005)
Article Google Scholar
Radivojac, P., et al.: Prediction of boundaries between intrinsically ordered and disordered protein regions. In: Pacific Symposium on Biocomputing, pp. 216–227 (2003)
Google Scholar
Weathers, E.A., et al.: Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein. FEBS Lett. 576, 348–352 (2004)
Article Google Scholar
Hansen, J.C., et al.: Intrinsic protein disorder, amino acid composition, and histone terminal domains. J. Biol. Chem. 281, 1853–1856 (2006)
Article Google Scholar
Uversky, V.N., et al.: Showing your id. J. Mol. Recognit. 18, 343–384 (2005)
Article Google Scholar
Dosztanyi, Z., et al.: The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J. Mol. Biol. 374, 827–839 (2005)
Article Google Scholar
Vullo, A., et al.: Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res. 34, 164–168 (2006)
Article Google Scholar
Mitchell, T.M.: Machine learning. McGraw-Hill, New York (1997)
MATH Google Scholar
Peng, K., et al.: Optimizing long intrinsic disorder predictors with protein evolutionary information. J. Bioinform. Comput. Biol. 3, 35–60 (2005)
Article Google Scholar
Peng, K., et al.: Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics 7, 208 (2006)
Article Google Scholar
Siepen, J.A., et al.: Beta edge strands in protein structure prediction and aggregation. Protein Sci. 12, 2348–2359 (2003)
Article Google Scholar
Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
Article MATH Google Scholar
Oh, J., et al.: Estimating neuronal variable importance with random forest. In: IEEE Bioengineering Conference, vol. 29, pp. 33–34. IEEE, Los Alamitos (2003)
Google Scholar
Bridewell, W., et al.: Reducing overfitting in process model induction. In: Twenty-Second International Conference on Machine Learning, pp. 81–88 (2005)
Google Scholar
Blake, C.L., et al.: UCI repository of machine learning databases (1998)
Google Scholar
Obradovic, Z., et al.: Predicting intrinsic disorder from amino acid sequence. Proteins: Structure, Function and Bioinformatics 53, 566–572 (2003)
Article Google Scholar
Hobohm, U., Sander, C.: Enlarged representative set of protein structures. Protein Sci. 3, 522 (1994)
Article Google Scholar
Romero, P., et al.: Sequence data analysis for long disordered regions prediction in the calcineurin family. Genome Informatics 8, 110–124 (1997)
Google Scholar
Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Fornasari, M.S., et al.: Site-specific amino acid replacement matrices from structurally constrained protein evolution simulations. Molecular Biology and Evolution 19, 352–356 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and IT, RMIT University, Melbourne, VIC 3001, AUS
Pengfei Han & Xiuzhen Zhang
The Walter and Eliza Hall Institute of Medical Research, Parkville, VIC 3050, AUS
Raymond S. Norton & Zhiping Feng

Authors

Pengfei Han
View author publications
You can also search for this author in PubMed Google Scholar
Xiuzhen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Raymond S. Norton
View author publications
You can also search for this author in PubMed Google Scholar
Zhiping Feng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Zhi-Hua Zhou Hang Li Qiang Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, P., Zhang, X., Norton, R.S., Feng, Z. (2007). Reducing Overfitting in Predicting Intrinsically Unstructured Proteins. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_53

Download citation

DOI: https://doi.org/10.1007/978-3-540-71701-0_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71700-3
Online ISBN: 978-3-540-71701-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics