A pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and prediction

Dai, Zhijun; Wang, Lifeng; Chen, Yuan; Wang, Haiyan; Bai, Lianyang; Yuan, Zheming

doi:10.1007/s00726-014-1667-5

A pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and prediction

Original Article
Published: 28 January 2014

Volume 46, pages 1105–1119, (2014)
Cite this article

Amino Acids Aims and scope Submit manuscript

Zhijun Dai^1,2,
Lifeng Wang^1,2,
Yuan Chen^1,2,
Haiyan Wang³,
Lianyang Bai^2,4 &
…
Zheming Yuan^1,2

1498 Accesses
6 Citations
Explore all metrics

Abstract

In this paper, we present a pipeline to perform improved QSAR analysis of peptides. The modeling involves a double selection procedure that first performs feature selection and then conducts sample selection before the final regression analysis. Five hundred and thirty-one physicochemical property parameters of amino acids were used as descriptors to characterize the structure of peptides. These high-dimensional descriptors then go through a feature selection process given by the binary matrix shuffling filter (BMSF) to obtain a set of important low-dimensional features. Each descriptor that passes the BMSF filtering also receives a weight defined through its contribution to reduce the estimation error. These selected features served as the predictors for subsequent sample selection and modeling. Based on the weighted Euclidean distances between samples, a common range was determined with high-dimensional semivariogram and then used as a threshold to select the near-neighbor samples from the training set. For each sample to be predicted, the QSAR model was established using SVR with the weighted, selected features based on the exclusive set of near-neighbor training samples. Prediction was conducted for each test sample accordingly. The performances of this pipeline are tested with the QSAR analysis of angiotensin-converting enzyme inhibitors and HLA-A*0201 data sets. Improved prediction accuracy was obtained in both applications. This pipeline can optimize the QSAR modeling from both the feature selection and sample selection perspectives. This leads to improved accuracy over single selection methods. We expect this pipeline to have extensive application prospect in the field of regression prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PepQSAR: a comprehensive data source and information platform for peptide quantitative structure–activity relationships

Article 06 December 2022

QSSR Modeling of Bacillus Subtilis Lipase A Peptide Collision Cross-Sections in Ion Mobility Spectrometry: Local Descriptor Versus Global Descriptor

Article 16 January 2021

MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling

Article 07 October 2023

References

Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM T Intelli Syst Techn (TIST) 2:27
Google Scholar
Collantes ER, Dunn WJ (1995) Amino acid side chain descriptors for quantitative structure–activity relationship studies of peptide analogues. J Med Chem 38:2705–2713
Article PubMed CAS Google Scholar
Doytchinova IA, Walshe V, Borrow P et al (2005) Towards the chemometric dissection of peptide–HLA-A* 0201 binding affinity: comparison of local and global QSAR models. J Comput Aided Mol Des 19:203–212
Article PubMed CAS Google Scholar
Eriksson L, Johansson E, Müller M et al (2000) On the selection of the training set in environmental QSAR analysis when compounds are clustered. J Chemom 14:599–616
Article CAS Google Scholar
Furusjö E, Svenson A, Rahmberg M et al (2006) The importance of outlier detection and training set selection for reliable environmental QSAR predictions. Chemosphere 63:99–108
Article PubMed CAS Google Scholar
Gedeck P, Rohde B, Bartels C (2006) QSAR-how good is it in practice? Comparison of descriptor sets on an unbiased cross section of corporate data sets. J Chem Inf Model 46:1924–1936
Article PubMed CAS Google Scholar
Golbraikh A, Tropsha A (2002) Beware of q ²! J Mol Graphics Model 20:269–276
Article CAS Google Scholar
Hellberg S, Eriksson L, Jonsson J et al (1991) Minimum analogue peptide sets (MAPS) for quantitative structure–activity relationships. Int J Pept Protein Res 37:414–424
Article PubMed CAS Google Scholar
Hemmateenejad B, Yousefinejad S, Mehdipour AR (2011) Novel amino acids indices based on quantum topological molecular similarity and their application to QSAR study of peptides. Amino Acids 40:1169–1183
Article PubMed CAS Google Scholar
Hemmateenejad B, Miri R, Elyasi M (2012) A segmented principal component analysis-regression approach to QSAR study of peptides. J Theor Biol 305:37–44
Article PubMed CAS Google Scholar
Hou T, McLaughlin W, Lu B et al (2006) Prediction of binding affinities between the human amphiphysin-1 SH3 domain and its peptide ligands using homology modeling, molecular dynamics and molecular field analysis. J Proteome Res 5:32–43
Article PubMed CAS Google Scholar
Hou T, Zhang W, Case DA et al (2008) Characterization of domain–peptide interaction interface: a case study on the amphiphysin-1 SH3 domain. J Mol Bio 376:1201–1214
Article CAS Google Scholar
Hou T, Xu Z, Zhang W et al (2009) Characterization of domain–peptide interaction interface A generic structure-based model to decipher the binding specificity of SH3 domains. Mol Cell Proteomics 8:639–649
Article PubMed Central PubMed CAS Google Scholar
Hou T, Li N, Li Y et al (2012) Characterization of domain–peptide interaction interface: prediction of SH3 domain-mediated protein–protein interaction network in yeast by generic structure-based models. J Proteome Res 11:2982–2995
Article PubMed Central PubMed CAS Google Scholar
Journel AG, Huijbregts CJ (1978) Mining geostatistics. Academic press, London
Google Scholar
Kawashima S, Kanehisa M (2000) AAindex: amino acid index database. Nucleic Acids Res 28:374
Article PubMed Central PubMed CAS Google Scholar
Kidera A, Konishi Y, Oka M et al (1985) Statistical analysis of the physical properties of the 20 naturally occurring amino acids. J Protein Chem 4:23–55
Article CAS Google Scholar
Leardi R (2000) Application of genetic algorithm-PLS for feature selection in spectral data sets. J Chemom 14:643–655
Article CAS Google Scholar
Li J, Gao XB, Jiao LC (2005) A new feature weighted fuzzy clustering algorithm. In: Rough sets, fuzzy sets, data mining, and granular computing. Springer Berlin, Heidelberg, pp 412–420
Liang GZ, Zhou P, Zhou Y et al (2006) New descriptors of amino acids and their applications to peptide quantitative structure–activity relationship. Acta Chim Sin 64:393–396
CAS Google Scholar
Liang G, Yang L, Kang L et al (2009) Using multidimensional patterns of amino acid attributes for QSAR analysis of peptides. Amino Acid 37:583–591
Article CAS Google Scholar
Lin ZH, Long HX, Bo Z et al (2008) New descriptors of amino acids and their application to peptide QSAR study. Peptides 29:1798–1805
Article PubMed CAS Google Scholar
Sandberg M, Eriksson L, Jonsson J et al (1998) New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J Med Chem 41:2481–2491
Article PubMed CAS Google Scholar
Sewald N, Jakubke HD (2002) Peptides: chemistry and biology (vol. 2). Wiley-Vch, Weinheim
Sheridan RP, Feuston BP, Maiorov VN et al (2004) Similarity to molecules in the training set is a good discriminator for prediction accuracy in QSAR. J Chem Inf Model 44:1912–1928
Article CAS Google Scholar
Tian F, Zhou P, Li Z (2007) T-scale as a novel vector of topological descriptors for amino acids and its application in QSARs of peptides. J Mol Struct 830:106–115
Article CAS Google Scholar
Tomii K, Kanehisa M (1996) Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng 9:27–36
Article PubMed CAS Google Scholar
Tropsha A, Gramatica P, Gombar VK (2003) The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci 22:69–77
Article CAS Google Scholar
Vivencio DP, Hruschka ER, Nicoletti MC et al (2007) Feature-weighted k-nearest neighbor classifier. In: Foundations of computational intelligence, 2007. FOCI 2007. IEEE Symposium on (pp 481–486). IEEE
Wölfel M, Ekenel HK (2005) Feature weighted Mahalanobis distance: improved robustness for Gaussian classifiers. In: 13th European signal processing conference
Xu Z, Hou T, Li N et al (2012) Proteome-wide detection of Abl1 SH3-binding peptides by integrating computational prediction and peptide microarray. Mol Cell Proteomics 11(O111):010389
PubMed Google Scholar
Yousefinejad S, Hemmateenejad B, Mhedipour AR (2012) New autocorrelation QTMS-based descriptors for use in QSAM of peptides. J Iran Chem Soc 9:569–577
Article CAS Google Scholar
Zhang H, Wang H, Dai Z et al (2012) Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics 13:1–20
Article Google Scholar
Zhou P, Chen X, Wu YQ et al (2010) Gaussian process: an alternative approach for QSAM modeling of peptides. Amino Acid 38:199–212
Article CAS Google Scholar

Download references

Acknowledgments

This work was supported by the Doctoral Foundation of Ministry of Education of China (No. 20124320110002), the Scientific Research Fund of the Hunan Provincial Financial Department (No. 62020411074) and the Postgraduate Scientific Research Innovation Project of Hunan Province, China (No. CX2013B306). The work of H. Wang was partially supported by a grant from the Simons Foundation (#246077).

Conflict of interest

The authors declare no conflict of interest.

Author information

Authors and Affiliations

Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China
Zhijun Dai, Lifeng Wang, Yuan Chen & Zheming Yuan
Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, China
Zhijun Dai, Lifeng Wang, Yuan Chen, Lianyang Bai & Zheming Yuan
Department of Statistics, Kansas State University, Manhattan, KS, USA
Haiyan Wang
Hunan Academy of Agricultural Sciences, Changsha, China
Lianyang Bai

Authors

Zhijun Dai
View author publications
You can also search for this author in PubMed Google Scholar
Lifeng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Haiyan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lianyang Bai
View author publications
You can also search for this author in PubMed Google Scholar
Zheming Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zheming Yuan.

Additional information

Z. Dai and L. Wang authors contributed equally to this work.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (XLSX 14 kb)

Supplementary material 2 (XLSX 14 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dai, Z., Wang, L., Chen, Y. et al. A pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and prediction. Amino Acids 46, 1105–1119 (2014). https://doi.org/10.1007/s00726-014-1667-5

Download citation

Received: 15 October 2013
Accepted: 03 January 2014
Published: 28 January 2014
Issue Date: April 2014
DOI: https://doi.org/10.1007/s00726-014-1667-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and prediction

Abstract

Access this article

Similar content being viewed by others

PepQSAR: a comprehensive data source and information platform for peptide quantitative structure–activity relationships

QSSR Modeling of Bacillus Subtilis Lipase A Peptide Collision Cross-Sections in Ion Mobility Spectrometry: Local Descriptor Versus Global Descriptor

MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling

References

Acknowledgments

Conflict of interest

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (XLSX 14 kb)

Supplementary material 2 (XLSX 14 kb)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A pipeline for improved QSAR analysis of peptides: physiochemical property parameter selection via BMSF, near-neighbor sample selection via semivariogram, and weighted SVR regression and prediction

Abstract

Access this article

Similar content being viewed by others

PepQSAR: a comprehensive data source and information platform for peptide quantitative structure–activity relationships

QSSR Modeling of Bacillus Subtilis Lipase A Peptide Collision Cross-Sections in Ion Mobility Spectrometry: Local Descriptor Versus Global Descriptor

MASSA Algorithm: an automated rational sampling of training and test subsets for QSAR modeling

References

Acknowledgments

Conflict of interest

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (XLSX 14 kb)

Supplementary material 2 (XLSX 14 kb)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation