Abstract
Automated text classification is the labeling of documents to the predefined class label or category using machine learning algorithms. It is one of the important domains in machine learning where the algorithm is applied to classify documents to the appropriate category or genre of the document. For example, the document might be news items and the class/category/genre might be business news, sport news, health news, financial news and social news. Due to the volume of this textual data and its presumed exponential growth, classical data mining techniques may not provide optimal performance in terms of efficiency. To this end, scalable machine learning library apache mahout with hadoop can be used to improve the performance of the algorithm and computation time. In this study Naïve Bayes classification algorithm is implemented on top of hadoop to build automatic document categorizer using Mapreduce programing model. Addis Ababa university institutional repository/Electronic thesis and dissertations text document is used for training and evaluation dataset. The proposed model achieved an accuracy of 79.06%. The result shows that the system can categorize large thesis documents into its predefined class with promising accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
1 zetta bytes = 1021 bytes or 1000 exabyte = 1 million petabytes = 1billion terabyte.
- 2.
References
Withanawasam, J.: Apache Mahout Essential. Packet, Birmingham (2015)
Lucidworks: Text Classification with Mahout and Lucene. https://lucidworks.com/2013/10/30/road-to-revolution-text-classification-powered-by-apache-mahout-and-lucene/. 10 Apr 2019
Tiwary, C.: Learning Apache Mahout_Acquire Practical Skills in Big Data Analytics and Explore Data Science with Apache Mahout. Packet, Birmingham (2015)
Jiang, E.P.: Content-based spam email classification using machine-learning algorithms. In: Text Mining: Applications and Theory, pp. 37–56. Wiley (2010)
Liu, B., Blasch, E., Chen, Y., Shen, D., Chen, G.: Scalable sentiment classification for big data analysis using Naïve Bayes classifier. In: 2013 IEEE International Conference on Big Data, Silicon Valley, CA, pp. 99–104 (2013)
Prabhat, A., Khullar, V.: Sentiment classification on big data using Naïve Bayes and logistic regression. In: 2017 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, pp. 1–5 (2017)
Wongso, R., Luwinda, F.A., Trisnajaya, B.C., Rudy, O.R.: News article text classification in Indonesian language. Procedia Comput. Sci. 116, 137–143 (2017)
Ghazi, M.R., Gangodkar, D.: Hadoop, MapReduce and HDFS: a developers perspective. Procedia Comput. Sci. 48, 45–50 (2015)
Kanavos, A., Nodarakis, N., Sioutas, S., Tsakalidis, A., Tsolis, D., Tzimas, G.: Large scale implementations for Twitter sentiment classification. Algorithms 10, 33 (2017)
Owen, S., et al.: Mahout in Action. Manning Publisher, Shelter Island (2012)
Ingersoll, G., et al.: Training Naive Bayes using Apache Mahout, pp. 1–10. Mannining Publication, Shelter Island (2015)
Kim, S.-B., Rim, H.-C., Yook, D., Lim, H.-S.: Effective methods for improving Naive Bayes text classifiers. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, pp. 414–423. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45683-X_45
Lewis, D.D.: Effective methods for improving Naïve Bayes text classifiers. Mach. Learn. 1398, 414–423 (2002)
Domingos, P., Pazzani, M.: On the optimality of the simple Bayesian classifier under zero one loss. Mach. Learn. 29, 103–130 (1997)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026683
Wei, L., et al.: Text classification using support vector machine with mixture of Kernel. J. Soft. Eng. Appl. 5, 55–58 (2012)
Ikonomakis, E.K., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques. WSEAS Trans. Comput. 4(8), 966–974 (2005)
Tegegnie, A.K., Tarekegn, A.N., Alem, T.A.: A comparative study of flat and hierarchical classification for Amharic news text using SVM. Int. J. Inf. Eng. 9, 36–42 (2017)
Asker, L., Argaw, A.A., Gambäck, B.: Applying machine learning to Amharic text classification. ResearchGate (2014)
Eyassu, B., Gamback, B.: Classifying Amharic text using self organizing map. In: Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, Ann Arbor (2005)
Salur, M.U., Tokat, S., Aydilek, İ.B.: Text classification on mahout with Naïve-Bayes machine learning algorithm. In: International Artificial Intelligence and Data Processing Symposium (IDAP), Malatya, pp. 1–5 (2017)
Santoso, J., Yuniarno, E.M., Hariadi, M.: Large scale text classification using map reduce and Naive Bayes algorithm for domain specified ontology building. In: 2015 7th International Conference on Intelligent Human-Machine Systems and Cybernetics, pp. 428–432. IEEE Computer Society, Hangzhou (2015)
Chen, H., Fu, D.: An improved Naive Bayes classifier for large scale text. In: 2018 2nd International Conference on Artificial Intelligence: Technologies and Applications (ICAITA 2018) (2018)
Feng, M., Wu, G.: A distributed Chinese Naive Bayes classifier based on word embedding. In: International Conference on Machinery, Materials and Computing Technologies. Atlantis Press (2016)
Gunarathne, T.: Hadoop Mapreduce v2 Cookbook. Packet, Birmingham (2015)
Gupta, A.: Learning Apache Mahout Classification. Packet, Birmingham (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Temesgen, M.M., Lemma, D.T. (2019). A Scalable Text Classification Using Naive Bayes with Hadoop Framework. In: Mekuria, F., Nigussie, E., Tegegne, T. (eds) Information and Communication Technology for Development for Africa. ICT4DA 2019. Communications in Computer and Information Science, vol 1026. Springer, Cham. https://doi.org/10.1007/978-3-030-26630-1_25
Download citation
DOI: https://doi.org/10.1007/978-3-030-26630-1_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26629-5
Online ISBN: 978-3-030-26630-1
eBook Packages: Computer ScienceComputer Science (R0)