MiSeRe-Hadoop: A Large-Scale Robust Sequential Classification Rules Mining Framework

Egho, Elias; Gay, Dominique; Trinquart, Romain; Boullé, Marc; Voisine, Nicolas; Clérot, Fabrice

doi:10.1007/978-3-319-64283-3_8

Elias Egho¹⁵,
Dominique Gay¹⁶,
Romain Trinquart¹⁵,
Marc Boullé¹⁵,
Nicolas Voisine¹⁵ &
…
Fabrice Clérot¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10440))

Included in the following conference series:

International Conference on Big Data Analytics and Knowledge Discovery

1700 Accesses
3 Citations

Abstract

Sequence classification has become a fundamental problem in data mining and machine learning. Feature based classification is one of the techniques that has been used widely for sequence classification. Mining sequential classification rules plays an important role in feature based classification. Despite the abundant literature in this area, mining sequential classification rules is still a challenge; few of the available methods are sufficiently scalable to handle large-scale datasets. MapReduce is an ideal framework to support distributed computing on large data sets on clusters of computers. In this paper, we propose a distributed version of MiSeRe algorithm on MapReduce, called MiSeRe-Hadoop. MiSeRe-Hadoop holds the same valuable properties as MiSeRe, i.e., it is: (i) robust and user parameter-free anytime algorithm and (ii) it employs an instance-based randomized strategy to promote diversity mining. We have applied our method on two real-world large datasets: a marketing dataset and a text dataset. Our results confirm that our method is scalable for large scale sequential data analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This file keeps a copy of all the candidate sequences generated from the job “ \({{\varvec{Generating\ Candidates}}}\) ” in each iteration.
2.
Orange Livebox is an ADSL wireless router available to customers of Orange’s Broadband services in several countries.

References

Anastasiu, D.C., Iverson, J., Smith, S., Karypis, G.: Big data frequent pattern mining. In: Aggarwal, C.C., Han, J. (eds.) Frequent Pattern Mining, pp. 225–259. Springer, Cham (2014). doi:10.1007/978-3-319-07821-2_10
Google Scholar
Andrews, G.R.: Foundations of Multithreaded, Parallel, and Distributed Programming. University of Arizona, Wesley (2000)
Google Scholar
Beedkar, K., Berberich, K., Gemulla, R., Miliaraki, I.: Closing the gap: sequence mining at scale. ACM Trans. Database Syst. 40(2), 8:1–8:44 (2015)
Article MathSciNet Google Scholar
Chen, C.C., Tseng, C.Y., Chen, M.S.: Highly scalable sequential pattern mining based on mapreduce model on the cloud. In: 2013 IEEE International Congress on Big Data, pp. 310–317 (2013)
Google Scholar
Cong, S., Han, J., Padua, D.: Parallel mining of closed sequential patterns. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 562–567. ACM (2005)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Deshpande, M., Karypis, G.: Evaluation of techniques for classifying biological sequences. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS, vol. 2336, pp. 417–431. Springer, Heidelberg (2002). doi:10.1007/3-540-47887-6_41
Chapter Google Scholar
Egho, E., Gay, D., Boullé, M., Voisine, N., Clérot, F.: A parameter-free approach for mining robust sequential classification rules. In: 2015 IEEE International Conference on Data Mining, ICDM 2015, Atlantic City, NJ, USA, November 14–17, 2015, pp. 745–750 (2015)
Google Scholar
Egho, E., Gay, D., Boullé, M., Voisine, N., Clérot, F.: A user parameter-free approach for mining robust sequential classification rules. Knowl. Inform. Syst. 52, 1–29 (2016)
Google Scholar
Egho, E., Jay, N., Raïssi, C., Nuemi, G., Quantin, C., Napoli, A.: An approach for mining care trajectories for chronic diseases. In: Peek, N., Marín Morales, R., Peleg, M. (eds.) AIME 2013. LNCS, vol. 7885, pp. 258–267. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38326-7_37
Chapter Google Scholar
Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput. 22(6), 789–828 (1996)
Article MATH Google Scholar
Guralnik, V., Karypis, G.: Parallel tree-projection-based sequence mining algorithms. Parallel Comput. 30(4), 443–472 (2004)
Article Google Scholar
Holat, P., Plantevit, M., Raïssi, C., Tomeh, N., Charnois, T., Crémilleux, B.: Sequence classification based on delta-free sequential patterns. In: ICDM 2014, pp. 170–179 (2014)
Google Scholar
Itkar, S., Kulkarni, U.: Distributed sequential pattern mining: a survey and future scope. Int. J. Comput. Appl. 94(18), 28–35 (2014)
Google Scholar
Jorge, A.M., Azevedo, P.J., Pereira, F.: Distribution rules with numeric attributes of interest. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) PKDD 2006. LNCS, vol. 4213, pp. 247–258. Springer, Heidelberg (2006). doi:10.1007/11871637_26
Chapter Google Scholar
Lesh, N., Zaki, M.J., Ogihara, M.: Mining features for sequence classification. In: ACM SIGKDD 1999, pp. 342–346 (1999)
Google Scholar
Qiao, S., Li, T., Peng, J., Qiu, J.: Parallel sequential pattern mining of massive trajectory data. Int. J. Comput. Intell. Syst. 3(3), 343–356 (2010)
Article Google Scholar
Sandhaus, E.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008)
Google Scholar
She, R., Chen, F., Wang, K., Ester, M., Gardy, J.L., Brinkman, F.S.L.: Frequent-subsequence-based prediction of outer membrane proteins. In: ACM SIGKDD 2003, pp. 436–445 (2003)
Google Scholar
Tan, P., Kumar, V.: Discovery of web robot sessions based on their navigational patterns. Data Min. Knowl. Discov. 6(1), 9–35 (2002)
Article MathSciNet Google Scholar
Tseng, V.S., Lee, C.: CBS: a new classification method by using sequential patterns. In: SDM 2005, pp. 596–600 (2005)
Google Scholar
Wang, J., Han, J.: BIDE: efficient mining of frequent closed sequences. In: ICDE 2004, pp. 79–90 (2004)
Google Scholar
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Article Google Scholar
Xing, Z., Pei, J., Keogh, E.J.: A brief survey on sequence classification. SIGKDD Explor. 12(1), 40–48 (2010)
Article Google Scholar
Zaki, M.: Sequence mining in categorical domains: incorporating constraints, pp. 422–429 (2000)
Google Scholar
Zaki, M.J.: Parallel sequence mining on shared-memory machines. J. Parallel Distrib. Comput. 61(3), 401–426 (2001)
Article MATH Google Scholar
Zhou, C., Cule, B., Goethals, B.: Itemset based sequence classification. In: ECML/PKDD 2013, pp. 353–368 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Orange Labs, 2, avenue Pierre Marzin, 22307, Lannion Cédex, France
Elias Egho, Romain Trinquart, Marc Boullé, Nicolas Voisine & Fabrice Clérot
Université de La Réunion, 2, rue Joseph Wetzell, 97490, Sainte Clotilde, France
Dominique Gay

Authors

Elias Egho
View author publications
You can also search for this author in PubMed Google Scholar
Dominique Gay
View author publications
You can also search for this author in PubMed Google Scholar
Romain Trinquart
View author publications
You can also search for this author in PubMed Google Scholar
Marc Boullé
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Voisine
View author publications
You can also search for this author in PubMed Google Scholar
Fabrice Clérot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dominique Gay .

Editor information

Editors and Affiliations

LIAS/ISAE-ENSMA, Chasseneuil, France
Ladjel Bellatreche
University of Texas at Arlington, Arlington, Texas, USA
Sharma Chakravarthy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Egho, E., Gay, D., Trinquart, R., Boullé, M., Voisine, N., Clérot, F. (2017). MiSeRe-Hadoop: A Large-Scale Robust Sequential Classification Rules Mining Framework. In: Bellatreche, L., Chakravarthy, S. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2017. Lecture Notes in Computer Science(), vol 10440. Springer, Cham. https://doi.org/10.1007/978-3-319-64283-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-64283-3_8
Published: 03 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64282-6
Online ISBN: 978-3-319-64283-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics