Abstract
Hierarchical data stream classification inherits the properties and constraints of hierarchical classification and data stream classification concomitantly. Therefore, it requires novel approaches that (i) can handle class hierarchies, (ii) can be updated over time, and (iii) are computationally light-weighted regarding processing time and memory usage. In this study, we propose the Gaussian Naive Bayes for Hierarchical Data Streams (GNB-hDS) method: an incremental Gaussian Naive Bayes for classifying potentially unbounded hierarchical data streams. GNB-hDS uses statistical summaries of the data stream instead of storing actual instances. These statistical summaries allow more efficient data storage, keep constant computational time and memory, and calculate the probability of an instance belonging to a specific class via the Bayes’ Theorem. We compare our method against a technique that stores raw instances, and results show that our method obtains equivalent prediction rates while being significantly faster.
Supported by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aha, D.W., Kibler, D., Albert, M.K.: Instance-based learning algorithms. Mach. Learn. 6(1), 37–66 (1991)
Alcobé, J.: Incremental learning of tree augmented Naive Bayes classifiers. In: Garijo, F.J., Riquelme, J.C., Toro, M. (eds.) IBERAMIA 2002. LNCS (LNAI), vol. 2527, pp. 32–41. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36131-6_4
Anderson, J.R., Matessa, M.: Explorations of an incremental, Bayesian algorithm for categorization. Mach. Learn. 9(4), 275–308 (1992)
Bahri, M., Maniu, S., Bifet, A.: A sketch-based Naive Bayes algorithms for evolving data streams. In: 2018 IEEE International Conference on Big Data (Big Data), pp. 604–613. IEEE (2018)
Barddal, J.P., Gomes, H.M., Enembreck, F., Pfahringer, B., Bifet, A.: On dynamic feature weighting for feature drifting data streams. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9852, pp. 129–144. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46227-1_9
Barros, R.S., Cabral, D.R., Gonçalves Jr., P.M., Santos, S.G.: RDDM: reactive drift detection method. Expert Syst. Appl. 90, 344–355 (2017)
Bi, W., Kwok, J.T.: Bayes-optimal hierarchical multilabel classification. IEEE Trans. Knowl. Data Eng. 27(11), 2907–2918 (2015)
Bifet, A., Gavalda, R.: Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 443–448. SIAM (2007)
Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavalda, R.: New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 139–148 (2009)
Bifet, A., Kirkby, R.: Data stream mining a practical approach (2009)
Bishop, C.M.: Pattern Recognition and Machine Learning. springer, Heidelberg (2006)
Burred, J.J., Lerch, A.: A hierarchical approach to automatic musical genre classification. In: Proceedings of the 6th International Conference on Digital Audio Effects, pp. 8–11. Citeseer (2003)
de Campos Merschmann, L.H., Freitas, A.A.: An extended local hierarchical classifier for prediction of protein and gene functions. In: Bellatreche, L., Mohania, M.K. (eds.) DaWaK 2013. LNCS, vol. 8057, pp. 159–171. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40131-2_14
Cesa-Bianchi, N., Gentile, C., Zaniboni, L.: Incremental algorithms for hierarchical classification. J. Mach. Learn. Res. 7, 31–54 (2006)
Chan, T.F., Golub, G.H., LeVeque, R.J.: Algorithms for computing the sample variance: analysis and recommendations. Am. Stat. 37(3), 242–247 (1983)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Frías-Blanco, I., del Campo-Ávila, J., Ramos-Jimenez, G., Morales-Bueno, R., Ortiz-Díaz, A., Caballero-Mota, Y.: Online and non-parametric drift detection methods based on Hoeffding’s bounds. IEEE Trans. Knowl. Data Eng. 27(3), 810–823 (2014)
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2), 131–163 (1997)
Gama, J.: Knowledge Discovery from Data Streams. Chapman and Hall/CRC (2010)
Gama, J., Sebastião, R., Rodrigues, P.P.: On evaluating stream learning algorithms. Mach. Learn. 90(3), 317–346 (2013)
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR) 46(4), 44 (2014)
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011)
Hesabi, Z.R., Tari, Z., Goscinski, A., Fahad, A., Khalil, I., Queiroz, C.: Data summarization techniques for big data—a survey. In: Khan, S.U., Zomaya, A.Y. (eds.) Handbook on Data Centers, pp. 1109–1152. Springer, New York (2015). https://doi.org/10.1007/978-1-4939-2092-1_38
Kiritchenko, S., Famili, F.: Functional annotation of genes using hierarchical text categorization. In: Proceedings of BioLink SIG, ISMB, January 2005
Klawonn, F., Angelov, P.: Evolving extended Naive Bayes classifiers. In: Sixth IEEE International Conference on Data Mining-Workshops (ICDMW 2006), pp. 643–647. IEEE (2006)
Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 160, 3–24 (2007)
Nguyen, H.L., Woon, Y.K., Ng, W.K.: A survey on data stream clustering and classification. Knowl. Inf. Syst. 45(3), 535–569 (2015)
Parmezan, A.R.S., Souza, V.M.A., Batista, G.E.A.P.A.: Towards hierarchical classification of data streams. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (eds.) CIARP 2018. LNCS, vol. 11401, pp. 314–322. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-13469-3_37
Pereira, R.M., Bertolini, D., Teixeira, L.O., Silla Jr., C.N., Costa, Y.M.: COVID-19 identification in chest x-ray images on flat and hierarchical classification scenarios. Comput. Methods Programs Biomed. 194, 105532 (2020)
Pontes, E.A.S.: A brief historical overview of the Gaussian curve: from Abraham de Moivre to Johann Carl Friedrich Gauss. Int. J. Eng. Sci. Invent. (IJESI), 28–34 (2018)
Prasad, B.R., Agarwal, S.: Stream data mining: platforms, algorithms, performance evaluators and research trends. Int. J. Database Theory Appl. 9(9), 201–218 (2016)
Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., Lawrence, N.D.: Dataset Shift in Machine Learning. The MIT Press, Cambridge (2009)
Seidl, T., Assent, I., Kranen, P., Krieger, R., Herrmann, J.: Indexing density models for incremental learning and anytime classification on data streams. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, pp. 311–322 (2009)
Shapiro, S.S., Wilk, M.B.: An analysis of variance test for normality (complete samples). Biometrika 52(3/4), 591–611 (1965)
Silla, C.N., Freitas, A.A.: A survey of hierarchical classification across different application domains. Data Min. Knowl. Discov. 22(1–2), 31–72 (2011)
Silla Jr., C.N., Freitas, A.A.: A global-model Naive Bayes approach to the hierarchical prediction of protein functions. In: 2009 Ninth IEEE International Conference on Data Mining, pp. 992–997. IEEE (2009)
Souza, V.M.A., Reis, D.M., Maletzke, A.G., Batista, G.E.A.P.A.: Challenges in benchmarking stream learning algorithms with real-world data. Data Min. Knowl. Discov., 1–54 (2020). https://doi.org/10.1007/s10618-020-00698-5
Steinbach, M., Ertöz, L., Kumar, V.: The Challenges of clustering high dimensional data. In: Wille, L.T. (ed.) New Directions in Statistical Physics, pp. 273–309. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-662-08968-2_16
Tsymbal, A.: The problem of concept drift: definitions and related work. Comput. Sci. Dep. Trinity Coll. Dublin 106(2), 58 (2004)
West, D.: Updating mean and variance estimates: an improved method. Commun. ACM 22(9), 532–535 (1979)
Wilcoxon, F.: Individual comparisons by ranking methods. In: Kotz, S., Johnson, N.L. (eds.) Breakthroughs in Statistics. Springer Series in Statistics (Perspectives in Statistics), pp. 196–202. Springer, New York (1992). https://doi.org/10.1007/978-1-4612-4380-9_16
Wu, F., Zhang, J., Honavar, V.: Learning classifiers using hierarchically structured class taxonomies. In: Zucker, J.-D., Saitta, L. (eds.) SARA 2005. LNCS (LNAI), vol. 3607, pp. 313–320. Springer, Heidelberg (2005). https://doi.org/10.1007/11527862_24
Yassin, N.I., Omran, S., El Houby, E.M., Allam, H.: Machine learning techniques for breast cancer computer aided diagnosis using different image modalities: a systematic review. Comput. Methods Progr. Biomed. 156, 25–45 (2018)
Yeo, I.K., Johnson, R.A.: A new family of power transformations to improve normality or symmetry. Biometrika 87(4), 954–959 (2000)
Zaragoza, J.C., Sucar, E., Morales, E., Bielza, C., Larranaga, P.: Bayesian chain classifiers for multidimensional classification. In: Twenty-Second International Joint Conference on Artificial Intelligence. Citeseer (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Tieppo, E., Barddal, J.P., Nievola, J.C. (2021). Classifying Potentially Unbounded Hierarchical Data Streams with Incremental Gaussian Naive Bayes. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13073. Springer, Cham. https://doi.org/10.1007/978-3-030-91702-9_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-91702-9_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91701-2
Online ISBN: 978-3-030-91702-9
eBook Packages: Computer ScienceComputer Science (R0)