Feature Selection in High-Dimensional Dataset Using MapReduce

Reggiani, Claudio; Le Borgne, Yann-Aël; Bontempi, Gianluca

doi:10.1007/978-3-319-76892-2_8

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 823))

Included in the following conference series:

Benelux Conference on Artificial Intelligence

1736 Accesses
6 Citations

Abstract

This paper describes a distributed MapReduce implementation of the minimum Redundancy Maximum Relevance algorithm, a popular feature selection method in bioinformatics and network inference problems. The proposed approach handles both tall/narrow and wide/short datasets. We further provide an open source implementation based on Hadoop/Spark, and illustrate its scalability on datasets involving millions of observations or features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Recent advances and emerging challenges of feature selection in the context of big data. Knowl. Based Syst. 86, 33–45 (2015)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Yeung, J.H., Tsang, C., Tsoi, K.H., Kwan, B.S., Cheung, C.C., Chan, A.P., Leong, P.H.: Map-reduce as a programming model for custom computing machines. In: 16th International Symposium on Field-Programmable Custom Computing Machines (FCCM 2008), pp. 149–159. IEEE (2008)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Article Google Scholar
Chu, C., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. Adv. Neural Inf. Process. Syst. 19, 281 (2007)
Google Scholar
Apache Mahout: Scalable machine learning and data mining. https://mahout.apache.org/
Meng, X., Bradley, J., Yuvaz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: MLlib: machine learning in apache spark. JMLR 17(34), 1–7 (2016)
MathSciNet MATH Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)
MATH Google Scholar
Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97(1–2), 273–324 (1997)
Article MATH Google Scholar
López, F.G., Torres, M.G., Batista, B.M., Pérez, J.A.M., Moreno-Vega, J.M.: Solving feature subset selection problem by a parallel scatter search. Eur. J. Oper. Res. 169(2), 477–489 (2006)
Article MathSciNet MATH Google Scholar
Melab, N., Cahon, S., Talbi, E.G.: Grid computing for parallel bioinspired algorithms. J. Parallel Distrib. Comput. 66(8), 1052–1061 (2006)
Article MATH Google Scholar
de Souza, J.T., Matwin, S., Japkowicz, N.: Parallelizing feature selection. Algorithmica 45(3), 433–456 (2006)
Article MathSciNet MATH Google Scholar
Garcia, D.J., Hall, L.O., Goldgof, D.B., Kramer, K.: A parallel feature selection algorithm from random subsets. In: Proceedings of the 17th European Conference on Machine Learning and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin (2006)
Google Scholar
Guillén, A., Sorjamaa, A., Miche, Y., Lendasse, A., Rojas, I.: Efficient parallel feature selection for steganography problems. In: Cabestany, J., Sandoval, F., Prieto, A., Corchado, J.M. (eds.) IWANN 2009. LNCS, vol. 5517, pp. 1224–1231. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02478-8_153
Chapter Google Scholar
Singh, S., Kubica, J., Larsen, S., Sorokina, D.: Parallel large scale feature selection for logistic regression. In: SDM, pp. 1172–1183. SIAM (2009)
Google Scholar
Peralta, D., Río, S., Ramírez, S., Triguero, I., Benítez, J.M., Herrera, F.: Evolutionary feature selection for big data classification: a MapReduce approach. In: Mathematical Problems in Engineering (2015)
Google Scholar
Zhao, Z., Zhang, R., Cox, J., Duling, D., Sarle, W.: Massively parallel feature selection: an approach based on variance preservation. Mach. Learn. 92(1), 195–220 (2013)
Article MathSciNet MATH Google Scholar
Sun, Z.: Parallel feature selection based on MapReduce. In: Wong, W.E., Zhu, T. (eds.) Computer Engineering and Networking. LNEE, vol. 277, pp. 299–306. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-01766-2_35
Chapter Google Scholar
Ordozgoiti, B., Gómez Canaval, S., Mozo, A.: Massively parallel unsupervised feature selection on spark. In: Morzy, T., Valduriez, P., Bellatreche, L. (eds.) ADBIS 2015. CCIS, vol. 539, pp. 186–196. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23201-0_21
Chapter Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Distributed feature selection: an application to microarray data classification. Appl. Soft Comput. J. 30, 136–150 (2015)
Article Google Scholar
Ramírez-Gallego, S., Mouriño-Talín, H., Martínez-Rego, D., Bolón-Canedo, V., Benítez, J.M., Alonso-Betanzos, A., Herrera, F.: An information theory-based feature selection framework for big data under apache spark. IEEE Trans. Syst. Man Cybern. Syst. PP(99), 1–13 (2017)
Article Google Scholar
Brown, G., Pocock, A., Ming-Jie, Z., Luján, M.: Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 13, 27–66 (2012)
MathSciNet MATH Google Scholar
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1226–1238 (2005)
Article Google Scholar
Meyer, P.E., Lafitte, F., Bontempi, G.: minet: a R/bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinform. 9, 461 (2008)
Article Google Scholar
Reggiani, C., Le Borgne, Y.A., Bontempi, G.: Feature selection in high-dimensional dataset using MapReduce. ArXiv e-prints, September 2017
Google Scholar
Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD 2010), New York, pp. 975–986. ACM (2010)
Google Scholar
Sarma, A.D., Afrati, F.N., Salihoglu, S., Ullman, J.D.: Upper and lower bounds on the cost of a map-reduce computation. In: Proceedings of the VLDB Endowment, vol. 6, pp. 277–288. VLDB Endowment (2013)
Google Scholar
Ahn, J., Jeon, Y.: Sparse HDLSS discrimination with constrained data piling. Comput. Stat. Data Anal. 90, 74–83 (2015)
Article MathSciNet Google Scholar
Jay, N.D., Papillon-Cavanagh, S., Olsen, C., Hachem, N., Bontempi, G., Haibe-Kains, B.: mRMRe: an R package for parallelized mRMR ensemble feature selection. Bioinformatics 29(18), 2365–2368 (2013)
Article Google Scholar
Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analytics, 1st edn. O’Reilly Media Inc., Sebastopol (2015)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI 2012), Berkeley, p. 2. USENIX Association (2012)
Google Scholar
Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Feature Selection for High-Dimensional Data. AIFTA. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21858-8
Book Google Scholar

Download references

Acknowledgement

The author CR acknowledges the funding of the BridgeIris project (RBC/13-PFS EH-11) supported by INNOVIRIS (Brussels Institute for the encouragement of scientific research and innovation) and The Belgian Kids’ Fund. The authors YLB and GB acknowledge the funding of the Brufence project (Scalable machine learning for automating defense system) supported by INNOVIRIS (Brussels Institute for the encouragement of scientific research and innovation).

Author information

Authors and Affiliations

Machine Learning Group, Faculty of Science, Université Libre de Bruxelles, Boulevard du Triomphe, CP 212, 1050, Brussels, Belgium
Claudio Reggiani, Yann-Aël Le Borgne & Gianluca Bontempi

Authors

Claudio Reggiani
View author publications
You can also search for this author in PubMed Google Scholar
Yann-Aël Le Borgne
View author publications
You can also search for this author in PubMed Google Scholar
Gianluca Bontempi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Claudio Reggiani .

Editor information

Editors and Affiliations

Artificial Intelligence, University of Groningen, Groningen, The Netherlands
Bart Verheij
Artificial Intelligence, University of Groningen, Groningen, The Netherlands
Marco Wiering

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Reggiani, C., Le Borgne, YA., Bontempi, G. (2018). Feature Selection in High-Dimensional Dataset Using MapReduce. In: Verheij, B., Wiering, M. (eds) Artificial Intelligence. BNAIC 2017. Communications in Computer and Information Science, vol 823. Springer, Cham. https://doi.org/10.1007/978-3-319-76892-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-76892-2_8
Published: 25 February 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76891-5
Online ISBN: 978-3-319-76892-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics