Skip to main content

Reducing Overfitting in Predicting Intrinsically Unstructured Proteins

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4426))

Included in the following conference series:

  • 1835 Accesses

Abstract

Intrinsically unstructured or disordered proteins are proteins that lack fixed 3-D structure globally or contain long disordered regions. Predicting disordered regions has attracted significant research recently. In developing a decision tree based disordered region predictor, we note that many previous predictors applying 20 amino acid compositions as training parameter tend to overfit the data. In this paper we propose to alleviate overfitting in prediction of intrinsically unstructured proteins by reducing input parameters. We also compare this approach with the random forest model, which is inherently tolerant to overfitting. Our experiments suggest that reducing 20 amino acid compositions into 4 groups according to amino acid property can reduce the overfitting in decision tree model. Alternatively, ensemble-learning techniques like random forest is inherently more tolerant to this kind of overfitting and can be a promising candidate in disordered region prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ward, J.J., et al.: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J. Mol. Biol 337, 635–645 (2004)

    Article  Google Scholar 

  2. Romero, P., et al.: Sequence complexity of disordered protein. Proteins: Structure, Function, and Genetics 42, 38–48 (2001)

    Article  Google Scholar 

  3. Coeytaux, K., Poupon, A.: Prediction of unfolded segments in a protein sequence based on amino acid composition. Bioinformatics 21, 1891–1900 (2005)

    Article  Google Scholar 

  4. Radivojac, P., et al.: Prediction of boundaries between intrinsically ordered and disordered protein regions. In: Pacific Symposium on Biocomputing, pp. 216–227 (2003)

    Google Scholar 

  5. Weathers, E.A., et al.: Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein. FEBS Lett. 576, 348–352 (2004)

    Article  Google Scholar 

  6. Hansen, J.C., et al.: Intrinsic protein disorder, amino acid composition, and histone terminal domains. J. Biol. Chem. 281, 1853–1856 (2006)

    Article  Google Scholar 

  7. Uversky, V.N., et al.: Showing your id. J. Mol. Recognit. 18, 343–384 (2005)

    Article  Google Scholar 

  8. Dosztanyi, Z., et al.: The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J. Mol. Biol. 374, 827–839 (2005)

    Article  Google Scholar 

  9. Vullo, A., et al.: Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines. Nucleic Acids Res. 34, 164–168 (2006)

    Article  Google Scholar 

  10. Mitchell, T.M.: Machine learning. McGraw-Hill, New York (1997)

    MATH  Google Scholar 

  11. Peng, K., et al.: Optimizing long intrinsic disorder predictors with protein evolutionary information. J. Bioinform. Comput. Biol. 3, 35–60 (2005)

    Article  Google Scholar 

  12. Peng, K., et al.: Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics 7, 208 (2006)

    Article  Google Scholar 

  13. Siepen, J.A., et al.: Beta edge strands in protein structure prediction and aggregation. Protein Sci. 12, 2348–2359 (2003)

    Article  Google Scholar 

  14. Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)

    Article  MATH  Google Scholar 

  15. Oh, J., et al.: Estimating neuronal variable importance with random forest. In: IEEE Bioengineering Conference, vol. 29, pp. 33–34. IEEE, Los Alamitos (2003)

    Google Scholar 

  16. Bridewell, W., et al.: Reducing overfitting in process model induction. In: Twenty-Second International Conference on Machine Learning, pp. 81–88 (2005)

    Google Scholar 

  17. Blake, C.L., et al.: UCI repository of machine learning databases (1998)

    Google Scholar 

  18. Obradovic, Z., et al.: Predicting intrinsic disorder from amino acid sequence. Proteins: Structure, Function and Bioinformatics 53, 566–572 (2003)

    Article  Google Scholar 

  19. Hobohm, U., Sander, C.: Enlarged representative set of protein structures. Protein Sci. 3, 522 (1994)

    Article  Google Scholar 

  20. Romero, P., et al.: Sequence data analysis for long disordered regions prediction in the calcineurin family. Genome Informatics 8, 110–124 (1997)

    Google Scholar 

  21. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann, San Francisco (1993)

    Google Scholar 

  22. Fornasari, M.S., et al.: Site-specific amino acid replacement matrices from structurally constrained protein evolution simulations. Molecular Biology and Evolution 19, 352–356 (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Zhi-Hua Zhou Hang Li Qiang Yang

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Han, P., Zhang, X., Norton, R.S., Feng, Z. (2007). Reducing Overfitting in Predicting Intrinsically Unstructured Proteins. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_53

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71701-0_53

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71700-3

  • Online ISBN: 978-3-540-71701-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics