Abstract
Quantitative structure-activity relationships (QSARs) are mathematical models aimed at finding a quantitative relationship between a set of chemical compounds and a specific activity or endpoint, such as toxicity, chemical or physical property, biological activity, and so on. In order to find out the correlation between the chemicals and the selected endpoints, QSAR models use the so-called molecular descriptors (MDs) which encode specific chemical information or features of the molecules. The early QSAR models were based on a small set of MDs and a specific endpoint, and the correlation was usually a linear mathematical correlation. However, nowadays, QSAR models are usually non-linear and made up by thousands of chemicals and hundreds of MDs. In addition, novel QSAR models are also aimed at the prediction of different endpoints with the same model, the so-called multi-target QSAR (MT-QSAR). Due to this, nowadays many QSARs are usually developed using machine learning approaches which can model a dataset with different endpoints. Although these approaches have demonstrated to be able to solve MT-QSAR models, feature selection (FS) in these cases is a challenging task and a main point in the QSAR field. Considering these aspects, the main aim of this chapter is to analyze feature selection methods while developing non-linear QSAR models.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hansch C, Muir RM, Fujita T, Maloney PP, Geiger F, Streich M (1963) The correlation of biological activity of plant growth regulators and chloromycetin derivatives with hammett constants and partition coefficients. J Am Chem Soc 85(18):2817–2824
Gombar VK, Enslein K, Blake BW (1995) Assessment of developmental toxicity potential of chemicals by quantitative structure-toxicity relationship models. Chemosphere 31(1):2499–2510
Roy K, Ghosh G (2004) QSTR with extended topochemical atom indices. 2. Fish toxicity of substituted benzenes. J Chem Inf Comput Sci 44(2):559–567
Basak SC, Nikolic S, Trinajstic N, Amic D, Beslo D (2000) QSPR modeling: graph connectivity indices versus line graph connectivity indices. J Chem Inf Comput Sci 40(4):927–933
Grover II, Singh II, Bakshi II (2000) Quantitative structure-property relationships in pharmaceutical research – part 2. Pharm Sci Technolo Today 3(2):50–57
Grover II, Singh II, Bakshi II (2000) Quantitative structure-property relationships in pharmaceutical research – part 1. Pharm Sci Technolo Today 3(1):28–35
Concu R, Kleandrova VV, Speck-Planche A, Cordeiro M (2017) Probing the toxicity of nanoparticles: a unified in silico machine learning model based on perturbation theory. Nanotoxicology 11(7):891–906
Burello E, Worth AP (2011) QSAR modeling of nanomaterials. Wiley Interdiscip Rev Nanomed Nanobiotechnol 3(3):298–306
Cherkasov A, Muratov EN, Fourches D, Varnek A, Baskin II, Cronin M et al (2014) QSAR modeling: where have you been? Where are you going to? J Med Chem 57(12):4977–5010
Wilm A, Kuhnl J, Kirchmair J (2018) Computational approaches for skin sensitization prediction. Crit Rev Toxicol 48(9):738–760
Ford KA (2016) Refinement, reduction, and replacement of animal toxicity tests by computational methods. ILAR J 57(2):226–233
Tropsha A (2010) Best practices for QSAR model development, validation, and exploitation. Mol Inform 29(6–7):476–488
Gaulton A, Hersey A, Nowotka M, Bento AP, Chambers J, Mendez D et al (2017) The ChEMBL database in 2017. Nucleic Acids Res 45(D1):D945–DD54
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S et al (2019) PubChem 2019 update: improved access to chemical data. Nucleic Acids Res 47(D1):D1102–D11D9
Pence HE, Williams A (2010) ChemSpider: an online chemical information resource. J Chem Educ 87(11):1123–1124
Jabeen I, Wetwitayaklung P, Chiba P, Pastor M, Ecker GF (2013) 2D- and 3D-QSAR studies of a series of benzopyranes and benzopyrano[3,4b][1,4]-oxazines as inhibitors of the multidrug transporter P-glycoprotein. J Comput Aided Mol Des 27(2):161–171
Mauri A, Consonni V, Pavan M, Todeschini R (2006) Dragon software: an easy approach to molecular descriptor calculations. Match Commun Math Comput Chem 56(2):237–248
Sadowski J, Gasteiger J, Klebe G (1994) Comparison of automatic three-dimensional model builders using 639 x-ray structures. J Chem Inf Comput Sci 34(4):1000–1008
Ignatz-Hoover F, Petrukhin R, Karelson M, Katritzky AR (2001) QSRR correlation of free-radical polymerization chain-transfer constants for styrene. J Chem Inf Comput Sci 41(2):295–299
Roy K, Pratim RP (2009) Comparative chemometric modeling of cytochrome 3A4 inhibitory activity of structurally diverse compounds using stepwise MLR, FA-MLR, PLS, GFA, G/PLS and ANN techniques. Eur J Med Chem 44(7):2913–2922
Baskin II, Palyulin VA, Zefirov NS (2008) Neural networks in building QSAR models. Methods Mol Biol 458:137–158
Wiese M, Schaper KJ (1993) Application of neural networks in the QSAR analysis of percent effect biological data: comparison with adaptive least squares and nonlinear regression analysis. SAR QSAR Environ Res 1(2–3):137–152
Zernov VV, Balakin KV, Ivaschenko AA, Savchuk NP, Pletnev IV (2003) Drug discovery using support vector machines. The case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions. J Chem Inf Comput Sci 43(6):2048–2056
Li S, Fedorowicz A, Andrew ME (2007) A new descriptor selection scheme for SVM in unbalanced class problem: a case study using skin sensitisation dataset. SAR QSAR Environ Res 18(5–6):423–441
Shi W, Zhang X, Shen Q (2010) Quantitative structure-activity relationships studies of CCR5 inhibitors and toxicity of aromatic compounds using gene expression programming. Eur J Med Chem 45(1):49–54
Stoyanova-Slavova IB, Slavov SH, Pearce B, Buzatu DA, Beger RD, Wilkes JG (2014) Partial least square and k-nearest neighbor algorithms for improved 3D quantitative spectral data-activity relationship consensus modeling of acute toxicity. Environ Toxicol Chem 33(6):1271–1282
Nikolic K, Filipic S, Smolinski A, Kaliszan R, Agbaba D (2013) Partial least square and hierarchical clustering in ADMET modeling: prediction of blood-brain barrier permeation of alpha-adrenergic and imidazoline receptor ligands. J Pharm Pharm Sci 16(4):622–647
Brandmaier S, Sahlin U, Tetko IV, Oberg T (2012) PLS-optimal: a stepwise D-optimal design based on latent variables. J Chem Inf Model 52(4):975–983
Koba M, Baczek T (2013) The evaluation of multivariate adaptive regression splines for the prediction of antitumor activity of acridinone derivatives. Med Chem 9(8):1041–1050
Put R, Xu QS, Massart DL, Vander HY (2004) Multivariate adaptive regression splines (MARS) in chromatographic quantitative structure-retention relationship studies. J Chromatogr A 1055(1–2):11–19
Scior T, Medina-Franco JL, Do QT, Martinez-Mayorga K, Yunes Rojas JA, Bernard P (2009) How to recognize and workaround pitfalls in QSAR studies: a critical review. Curr Med Chem 16(32):4297–4313
Gramatica P (2013) On the development and validation of QSAR models. Methods Mol Biol 930:499–526
Basak SC, Natarajan R, Mills D, Hawkins DM, Kraker JJ (2006) Quantitative structure-activity relationship modeling of juvenile hormone mimetic compounds for Culex pipiens larvae, with a discussion of descriptor-thinning methods. J Chem Inf Model 46(1):65–77
Khan PM, Roy K (2018) Current approaches for choosing feature selection and learning algorithms in quantitative structure-activity relationships (QSAR). Expert Opin Drug Dis 13(12):1075–1089
Tetko IV, Sushko I, Pandey AK, Zhu H, Tropsha A, Papa E et al (2008) Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection. J Chem Inf Model 48(9):1733–1746
Topliss JG (1972) Utilization of operational schemes for analog synthesis in drug design. J Med Chem 15(10):1006–1011
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1(1):131–156
Kira K, Rendell LA (1992) The feature selection problem: traditional methods and a new algorithm. Proceedings of the tenth national conference on artificial intelligence, San Jose, 1867155, AAAI Press, pp 129–134
Narendra PM, Fukunaga K (1977) A branch and bound algorithm for feature subset selection. IEEE Trans Comput 26(9):917–922
Koller D, Sahami M (1996) Toward optimal feature selection. Proceedings of the thirteenth international conference on machine learning, Bari, 3091731, Morgan Kaufmann Publishers Inc., pp 284–292
Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151(1):155–176
Arauzo-Azofra A, Benitez JM, Castro JL (2008) Consistency measures for feature selection. J Intell Inf Syst 30(3):273–292
Jun BH, Kim CS, Song H, Kim J (1997) A new criterion in selection and discretization of attributes for the generation of decision trees. IEEE Trans Pattern Anal Mach Intell 19(12):1371–1375
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comp Electr Eng 40(1):16–28
Piramuthu S (2004) Evaluating feature selection methods for learning in data mining applications. Eur J Oper Res 156(2):483–494
Whitley DC, Ford MG, Livingstone DJ (2000) Unsupervised forward selection: a method for eliminating redundant variables. J Chem Inf Comput Sci 40(5):1160–1168
Sutter JM, Kalivas JH (1993) Comparison of forward selection, backward elimination, and generalized simulated annealing for variable selection. Microchem J 47(1):60–66
Livingstone DJ, Salt DW (2005) Variable selection—Spoilt for choice? Reviews in Computational Chemistry, Ed. Lipkowitz KB, Larter R, Cundari TR, John Wiley & Sons, Inc., chap.4, vol 21, pp. 287–348
Almuallim H, Dietterich TG (1991) Learning with many irrelevant features. Proceedings of the ninth National conference on Artificial intelligence, vol 2, Anaheim, 1865761, AAAI Press, pp 547–552
Almuallim H, Dietterich TG (1994) Learning Boolean concepts in the presence of many irrelevant features. Artif Intell 69(1):279–305
Arauzo A, Benítez JM, Castro JL (eds) C-FOCUS: a continuous extension of FOCUS2003. Springer, London
Tay FEH, Lixiang S (2002) A modified Chi2 algorithm for discretization. IEEE Trans Knowl Data Eng 14(3):666–670
Boros E, Hammer PL, Ibaraki T, Kogan A, Mayoraz E, Muchnik I (2000) An implementation of logical analysis of data. IEEE Trans Knowl Data Eng 12(2):292–306
Demšar J, Zupan B, Leban G, Curk T (eds) Orange: from experimental machine learning to interactive data mining 2004. Springer Berlin Heidelberg, Berlin, Heidelberg
Bell DA, Wang H (2000) A formalism for relevance and its application in feature subset selection. Mach Learn 41(2):175–195
Cardie C (1993) Using decision trees to improve case-based learning, in machine learning proceedings. Morgan Kaufmann, San Francisco (CA), pp 25–32
Hanchuan P, Fuhui L, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511
Jain A, Zongker D (1997) Feature selection: evaluation, application, and small sample performance. IEEE Trans Pattern Anal Mach Intell 19(2):153–158
Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P et al (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24:227
Ding C, Peng H (eds) (2003) Minimum redundancy feature selection from microarray gene expression data. Computational systems bioinformatics CSB2003 proceedings of the 2003 IEEE bioinformatics conference CSB2003, 11–14 Aug 2003
Claypo N, Jaiyen S (eds) (2015) A new feature selection based on class dependency and feature dissimilarity. 2015 2nd international conference on advanced informatics: concepts, theory and applications (ICAICTA), 19–22 Aug 2015
Yu-Shuen T, Ueng-Cheng Y, Chung IF, Chuen-Der H (eds) (2013) A comparison of mutual and fuzzy-mutual information-based feature selection strategies. 2013 IEEE international conference on fuzzy systems (FUZZ-IEEE), 7–10 July 2013
Cheng Q, Zhou H, Cheng J (2011) The Fisher-Markov selector: fast selecting maximally separable feature subset for multiclass classification with applications to high-dimensional data 2011, pp 1217–1233
Aalaei S, Shahraki H, Rowhanimanesh A, Eslami S (2016) Feature selection using genetic algorithm for breast cancer diagnosis: experiment on three different datasets. Iran J Basic Med Sci 19(5):476–482
Fukunaga K (1990) Chapter 10 – feature extraction and linear mapping for classification. In: Fukunaga K (ed) Introduction to statistical pattern recognition, 2nd edn. Academic Press, Boston, pp 441–507
Fukunaga K (1990) Chapter 9 – feature extraction and linear mapping for signal representation. In: Fukunaga K (ed) Introduction to statistical pattern recognition, 2nd edn. Academic Press, Boston, pp 399–440
Choi E, Lee C (2003) Feature extraction based on the Bhattacharyya distance. Pattern Recogn 36(8):1703–1709
Drotár P, Gazda J, Smékal Z (2015) An experimental comparison of feature selection methods on two-class biomedical datasets. Comput Biol Med 66:1–10
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1):389–422
Akhlaghi Y, Kompany-Zareh M (2006) Application of radial basis function networks and successive projections algorithm in a QSAR study of anti-HIV activity for a large group of HEPT derivatives. J Chemom 20(1–2):1–12
Shanableh T, Assaleh K (2010) Feature modeling using polynomial classifiers and stepwise regression. Neurocomputing 73(10):1752–1759
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1):273–324
Naseriparsa M, Bidgoli A-M, Varaee T (2013) A hybrid feature selection method to improve performance of a group of classification algorithms. CoRR;abs/1403.2372
Nicolotti O, Carotti A (2006) QSAR and QSPR studies of a highly structured physicochemical domain. J Chem Inf Model 46(1):264–276
Yang J, Honavar V (1998) Feature subset selection using a genetic algorithm. In: Liu H, Motoda H (eds) Feature extraction, construction and selection: a data mining perspective. Springer US, Boston, pp 117–136
Wang XZ, Buontempo FV, Young A, Osborn D (2006) Induction of decision trees using genetic programming for modelling ecotoxicity data: adaptive discretization of real-valued endpoints. SAR QSAR Environ Res 17(5):451–471
Fjell CD, Jenssen H, Cheung WA, Hancock RE, Cherkasov A (2011) Optimization of antibacterial peptides by genetic algorithms and cheminformatics. Chem Biol Drug Des 77(1):48–56
Kumar M, Husain M, Upreti N, Gupta D (2010) Genetic algorithm: review and application. IJITM 2(2):451–454
Weile DS, Michielssen E (1997) Genetic algorithm optimization applied to electromagnetics: a review. IEEE Trans Antennas Propag 45(3):343–353
Hopper E, Turton B (eds) (1998) Application of genetic algorithms to packing problems — a review. Springer, London
Hussein F, Kharma N, Ward R (eds) (2001) Genetic algorithms for feature selection and weighting, a review and study. Proceedings of Sixth International Conference on Document Analysis and Recognition. 13 Sept 2001
Leardi R (2001) Genetic algorithms in chemometrics and chemistry: a review. J Chemom 15(7):559–569
Fernandez M, Caballero J, Fernandez L, Sarai A (2011) Genetic algorithm optimization in drug design QSAR: Bayesian-regularized genetic neural networks (BRGNN) and genetic algorithm-optimized support vectors machines (GA-SVM). Mol Divers 15(1):269–289
Niculescu SP (2003) Artificial neural networks and genetic algorithms in QSAR. J Mol Struct THEOCHEM 622(1):71–83
Venkatraman V, Dalby AR, Yang ZR (2004) Evaluation of mutual information and genetic programming for feature selection in QSAR. J Chem Inf Comput Sci 44(5):1686–1692
Zhou A, Qu B-Y, Li H, Zhao S-Z, Suganthan PN, Zhang Q (2011) Multiobjective evolutionary algorithms: a survey of the state of the art. Swarm Evolutionary Comput 1(1):32–49
Ozdemir M, Embrechts MJ, Arciniegas F, Breneman CM, Lockwood L, Bennett KP (eds) (2001) Feature selection for in-silico drug design using genetic algorithms and neural networks. SMCia/01 proceedings of the 2001 IEEE mountain workshop on soft computing in industrial applications (Cat No01EX504), 27 June 2001
Bahmani A, Saaidpour S, Rostami A (2017) Quantitative structure–retention relationship modeling of morphine and its derivatives on OV-1 column in gas–liquid chromatography using genetic algorithm. Chromatographia 80(4):629–636
Mizera M, Krause A, Zalewski P, Skibiński R, Cielecka-Piontek J (2017) Quantitative structure-retention relationship model for the determination of naratriptan hydrochloride and its impurities based on artificial neural networks coupled with genetic algorithm. Talanta 164:164–174
Ghasemi G, Nirouei M, Shariati S, Abdolmaleki P, Rastgoo Z (2016) A quantitative structure–activity relationship study on HIV-1 integrase inhibitors using genetic algorithm, artificial neural networks and different statistical methods. Arab J Chem 9:S185–SS90
Velásco-Mejía A, Vallejo-Becerra V, Chávez-Ramírez AU, Torres-González J, Reyes-Vidal Y, Castañeda-Zaldivar F (2016) Modeling and optimization of a pharmaceutical crystallization process by using neural networks and genetic algorithms. Powder Technol 292:122–128
Li Y, Abbaspour MR, Grootendorst PV, Rauth AM, Wu XY (2015) Optimization of controlled release nanoparticle formulation of verapamil hydrochloride using artificial neural networks with genetic algorithm and response surface methodology. Eur J Pharm Biopharm 94:170–179
Noorizadeh H, Farmany A, Noorizadeh M (2011) Application of GA–KPLS and L–M ANN calculations for the prediction of the capacity factor of hazardous psychoactive designer drugs. Med Chem Res 21:2680–2688
Sukumar N, Prabhu G, Saha P (2014) Applications of genetic algorithms in QSAR/QSPR modeling. In: Valadi J, Siarry P (eds) Applications of metaheuristics in process engineering. Springer International Publishing, Cham, pp 315–324
Dorigo M, Maniezzo V, Colorni A (1996) Ant system: optimization by a colony of cooperating agents. IEEE Trans Syst Man Cybern B Cybern 26(1):29–41
Mullen RJ, Monekosso D, Barman S, Remagnino P (2009) A review of ant algorithms. Expert Syst Appl 36(6):9608–9617
Goodarzi M, Freitas MP, Jensen R (2009) Feature selection and linear/nonlinear regression methods for the accurate prediction of glycogen synthase kinase-3 beta inhibitory activities. J Chem Inf Model 49(4):824–832
Niu B, Lu W-C, Yang S-S, Cai Y-D, Li G-Z (2007) Support vector machine for SAR/QSAR of phenethyl-amines1. Acta Pharmacol Sin 28(7):1075–1086
Embrechts MJ, Arciniegas F, Ozdemir M, Breneman CM, Bennett K, Lockwood L (eds) (2001) Bagging neural network sensitivity analysis for feature reduction for in-silico drug design. IJCNN’01 international joint conference on neural networks proceedings (Cat No01CH37222), 15–19 July 2001
Tanabe K, Kurita T, Nishida K, Lučić B, Amić D, Suzuki T (2013) Improvement of carcinogenicity prediction performances based on sensitivity analysis in variable selection of SVM models. SAR QSAR Environ Res 24(7):565–580
Kennedy J, Eberhart R (eds) (1995) Particle swarm optimization. Proceedings of ICNN’95 – international conference on neural networks. 27 Nov–1 Dec. 1995
Agrafiotis DK, Cedeño W (2002) Feature selection for structure−activity correlation using binary particle swarms. J Med Chem 45(5):1098–1107
Wang Z, Durst GL, Eberhart RC, Boyd DB, Miled ZB (eds) Particle swarm optimization and neural network application for QSAR. 18th international parallel and distributed processing symposium, 2004 proceedings, 26–30 Apr 2004
Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ (2004) Effect of molecular descriptor feature selection in support vector machine classification of pharmacokinetic and toxicological properties of chemical agents. J Chem Inf Comput Sci 44(5):1630–1638
Soto AJ, Cecchini RL, Vazquez GE, Ponzoni I (2009) Multi-objective feature selection in QSAR using a machine learning approach. QSAR Comb Sci 28(11–12):1509–1523
Acknowledgments
This work received financial support from Fundaçao para a Ciência e a Tecnologia (FCT/MEC) through national funds and co-financed by the European Union (FEDER funds) under the Partnership Agreement PT2020, through projects UID/QUI/ 50006/2013, POCI/01/0145/FEDER/007265, NORTE-01-0145-FEDER-000011 (LAQV@REQUIMTE), and the Interreg SUDOE NanoDesk (SOE1/P1/E0215; UP). RC acknowledges also FCT and the European Social Fund for financial support (Grant SFRH/BPD/80605/2011). To all financing sources, the authors are greatly indebted.
The authors declare no competing financial interest.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Concu, R., Cordeiro, M.N.D.S. (2020). On the Relevance of Feature Selection Algorithms While Developing Non-linear QSARs. In: Roy, K. (eds) Ecotoxicological QSARs. Methods in Pharmacology and Toxicology. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-0150-1_8
Download citation
DOI: https://doi.org/10.1007/978-1-0716-0150-1_8
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-0149-5
Online ISBN: 978-1-0716-0150-1
eBook Packages: Springer Protocols