Abstract
The rapid growth of unstructured data has become a key factor that drives the development of enterprises. Several problems should be addressed when obtaining effective access to massive amounts of unstructured data, such as data stored in scattered locations, differences in data access, and non-unified data formats. In this article, we use Hadoop to build a distributed computing platform that stores unstructured data and improve the Hadoop scheduling algorithm on the basis of the end time of slow tasks. The improved algorithm can avoid the execution of slow bulk task caused by non-uniform velocity nodes for Hadoop in a heterogeneous environment and can improve operating efficiency and stability. Furthermore, we propose a classification index construction method using non-training sets, thereby improving the term frequency–inverse document frequency weight formula by introducing timeliness and entropy. On this basis, we propose a classification algorithm that follows the principle of document similarity and document classification algorithm and does not use training sets. Finally, we describe the construction process of the classification index that is based on the training set by combining Hadoop and Lucene. As a proof of concept, we implement a prototype system using the Hadoop platform of our improved scheduling algorithm and conduct experimental studies to demonstrate the feasibility and performance of our approach.
Similar content being viewed by others
References
Angus, D., Rintel, S., & Wiles, J. (2013). Making sense of big text: A visual-first approach for analysing text data using Leximancer and Discursis. International Journal of Social Research Methodology, 16(3), 261–267.
Baars, H., & Kemper, H. G. (2008). Management support with structured and unstructured data—An integrated business intelligence framework. Information Systems Management, 25(2), 132–148.
Blumberg, R., & Atre, S. (2003). The problem with unstructured data. Dm Review, 13(42–49), 62.
Yao, Y., Wang, J., Sheng, B., et al. (2017). Self-adjusting slot configurations for homogeneous and heterogeneous hadoop clusters. IEEE Transactions on Cloud Computing, 5(2), 344–357.
Aji, A., Wang, F., Vo, H., et al. (2013). Hadoop gis: A high performance spatial data warehousing system over mapreduce. Proceedings of the VLDB Endowment, 6(11), 1009–1020.
Olson, D. K., Fröhlich, F., Farese, R. V., et al. (2016). Taming the sphinx: Mechanisms of cellular sphingolipid homeostasis. Biochimica et Biophysica Acta (BBA)-Molecular and Cell Biology of Lipids, 1861(8), 784–792.
Balipa, M., & Balasubramani, R. (2015). Search engine using Apache Lucene. International Journal of Computer Applications, 127(9), 27–30.
Kim, D., Choi, J., & Woo, C. (2014). A design and development of big data indexing and search system using Lucene. Journal of Internet Computing and Services, 15(6), 107–115.
Xu, J., & Croft, W. B. (2017). Quary expansion using local and global document analysis. ACM SIGIR Forum, 51(2), 168–175.
He, W., & Wang, F. K. (2016). Integrating a case-based reasoning shell and Web 2.0: Design recommendations and insights. World Wide Web, 19(6), 1231–1249.
Kong, B., Liu, X., & Zhang, J. (2006). Incremental support vector machine based on center distance ratio. Journal of Computer Applications, 26(6), 1434–1436.
Gao, X. M., Chen, F., Song, F. X., et al. (2008). Influence of feature weight on text categorization performance of Bayesian classifier. Computer Application, 28(12), 3080–3084.
Fayed, H. A., & Atiya, A. F. (2009). A novel template reduction approach for the K-nearest neighbor method. IEEE Transactions on Neural Networks, 20(5), 890–896.
Ye, J. (2015). Improved cosine similarity measures of simplified neutrosophic sets for medical diagnoses. Artificial Intelligence in Medicine, 63(3), 171–179.
Borjigen, C. (2015). Mass collaborative knowledge management: Towards the next generation of knowledge management studies. Program, 49(3), 325–342.
Acknowledgements
The authors are supported by the Science and Technology Research Project of Chongqing Education Committee of China (KJ1602203), the Scientific Research Programs in Higher Education of Chongqing Institute of Higher Education (CQGJ15203B).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cai, T., Yang, X. Non-structured Data Integration Access Policy Using Hadoop. Wireless Pers Commun 102, 895–908 (2018). https://doi.org/10.1007/s11277-017-5112-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-017-5112-4