Skip to main content

Machine Learning in Data Lake for Combining Data Silos

  • Conference paper
  • First Online:
Data Mining and Big Data (DMBD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10387))

Included in the following conference series:

Abstract

Data silo can grow to be a large-scale data for years, overlapping and has an indefinite quality. It allows an organization to develop their own analytical capabilities. Data lake has the ability to solve this problem efficiently with the data analysis by using statistical and predictive modeling techniques which can be applied to enhance and support an organization’s business strategy. This study provides an overview of the process of decision-making, operational efficiency, and creating the solution for an organization. Machine Learning can distribute the architecture of data model and integrate the data silo with other organizations data to optimize the operational business processes within an organization in order to improve data quality and efficiency. Testing is done by utilizing the data from the Malaysia’s and Singapore’s Government Open Data on the Air Pollutant Index to determine the condition of air pollution levels for the health and safety of the population.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Das, T.K., Mishra, M.R.: A study on challenges and opportunities in master data management. Int. J. Database Manag. Syst. 3, 129–139 (2011)

    Article  Google Scholar 

  2. Bilal, M., Oyedele, L.O., Akinade, O.O., Ajayi, S.O., Alaka, H.A., Owolabi, H.A., Bello, S.A.: Big data architecture for construction waste analytics (CWA): a conceptual framework. J. Build. Eng. 6, 144–156 (2016)

    Article  Google Scholar 

  3. Bobek, S., Nalepa, G.J.: Uncertain context data management in dynamic mobile environments. Future Gener. Comput. Syst. 66, 110–124 (2016)

    Article  Google Scholar 

  4. Philip Chen, C., Zhang, C.Y.: Data-intensive applications, challenges, techniques and technologies: a survey on Big Data. Inf. Sci. 275, 314–347 (2014)

    Article  Google Scholar 

  5. Khanna, A., Bhatt, H., Hardik, M., Sakhuja, R.: Data lake for socio-knowledge platforms. Int. J. Eng. Appl. Sci. Technol. 1, 110–113 (2016)

    Google Scholar 

  6. Milanov, G., Njeguš, A.: Using scrum in master data management development projects. J. Emerg. Trends Comput. Inf. Sci. 3, 1446–1452 (2012)

    Google Scholar 

  7. Indrajani, I.: Master data management model in company: challenges and opportunity. ComTech. 6, 514–524 (2015)

    Google Scholar 

  8. Konverse: Big Data Lake (2015)

    Google Scholar 

  9. Margaret, R.: Data Silo Definition (2015). http://searchcloudapplications.techtarget.com/definition/data-silo

  10. Fenwick, T., Brunsdon, D., Seville, E.: Reducing the Impact of Organisational Silos on Resilience (2009). http://www.resorgs.org.nz/images/stories/pdfs/silos.pdf

  11. Yaqoob, I., Hashem, I.A.T., Gani, A., Mokhtar, S., Ahmed, E., Anuar, N.B., Vasilakos, A.V.: Big data: from beginning to future. Int. J. Inf. Manage. 36(6), 1231–1247 (2016)

    Article  Google Scholar 

  12. Gandomi, A., Haider, M.: Beyond the hype: big data concepts, methods, and analytics. Int. J. Inf. Manage. 35(2), 137–144 (2015)

    Article  Google Scholar 

  13. Eygm, K.W., Peng, T.: big data changing the way businesses. Int. J. Simul. Syst. Sci. Technol. 16, 28 (2014)

    Google Scholar 

  14. Lomotey, R.K., Deters, R.: Terms extraction from unstructured data silos. In: Proceedings of 2013 8th International Conference on System of Systems Engineering: SoSE in Cloud Computing and Emerging Information Technology Applications, SoSE 2013, pp. 19–24 (2013)

    Google Scholar 

  15. Krensky, P.: Winter is not coming: eliminating data silos and ending information hoarding. 10 (2015)

    Google Scholar 

  16. Lowans, B., Perkins, E.: Market Guide for Data-Centric Audit and Protection (2014)

    Google Scholar 

  17. Shahrokni, A., Soderberg, J.: Beyond information silos challenges in integrating industrial model-based data. CEUR Workshop Proc. 1406, 63–72 (2015)

    Google Scholar 

  18. Laskowski, N.: Data lake governance: a big data do or die (2016). http://searchcio.techtarget.com/feature/Data-lake-governance-A-big-data-do-or-die

  19. Kottursamy, K., Raja, G., Padmanabhan, J., Srinivasan, V.: An improved database synchronization mechanism for mobile data using software-defined networking control. Comput. Electr. Eng. 1–11 (2016)

    Google Scholar 

  20. Hegde, S.D., Ravinarayana, B.: Survey paper on data lake. Int. J. Sci. Res. 5(7) (2016)

    Google Scholar 

  21. Miloslavskaya, N., Tolstoy, A.: Big data, fast data and data lake concepts. Procedia Comput. Sci. 88, 300–305 (2016)

    Article  Google Scholar 

  22. Terrizzano, I., Schwarz, P., Roth, M., Colino, J.E.: Data wrangling: the challenging journey from the wild to the lake. 7th Biennial Conference on Innovative Data Systems Research, CIDR 2015 (2015)

    Google Scholar 

  23. Means, A.: The Role of Data Lakes in Healthcare. North America, Perficient (2014)

    Google Scholar 

  24. Tan, P.: Introduction to Data Mining. Pearson Education, Boston (2006)

    Google Scholar 

  25. Budiharto, W.: Machine learning & computational intelligence. Penerbit Andi (2016)

    Google Scholar 

  26. Zhou, L., Pan, S., Wang, J., Vasilakos, A.V.: Machine learning on big data: opportunities and challenges. Neurocomputing (2017)

    Google Scholar 

  27. Nugroho, A.S.: Pengantar Support Vector Machine (2007)

    Google Scholar 

  28. Sun, S., Luo, C., Chen, J.: A review of natural language processing techniques for opinion mining system. Elsevier B.V. (2016)

    Google Scholar 

  29. Tsangaratos, P., Ilia, I.: Comparison of a logistic regression and Naïve Bayes classifier in landslide susceptibility assessments: the influence of models complexity and training dataset size. CATENA 145, 164–179 (2016)

    Article  Google Scholar 

  30. Mayilvaganan, M., Kalpanadevi, D.: Comparison of classification techniques for predicting the performance of students’ academic environment. In: International Conference on IEEE Communication and Network Technologies (ICCN 2014), pp. 113–118 (2014)

    Google Scholar 

  31. Gray, G., McGuinness, C., Owende, P.: An application of classification models to predict learner progression in tertiary education. In: IEEE International Advance Computing Conference (IACC), pp. 549–554 (2014)

    Google Scholar 

  32. El-Alfy, E.-S.M., Alshammari, M.A.: Towards scalable rough set based attribute subset selection for intrusion detection using parallel genetic algorithm in MapReduce. Simul. Model. Pract. Theory., 1–12 (2016)

    Google Scholar 

  33. Fang, B.W., Hu, B.Q.: Probabilistic graded rough set and double relative quantitative decision-theoretic rough set. Int. J. Approx. Reason. 74, 1–12 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  34. Jiang, L., Li, C., Wang, S., Zhang, L.: Deep feature weighting for naive Bayes and its application to text classification. Eng. Appl. Artif. Intell. 52, 26–39 (2016)

    Article  Google Scholar 

  35. Pawlak, Z.: Rough sets. Int. J. Comput. Inf. Sci. 11, 341–356 (1982)

    Article  MATH  Google Scholar 

  36. Rissino, S., Lambert-Torres, G.: Rough Set Theory–Fundamental Concepts, Principals, Data Extraction, and Applications. In: Ponce, J., Karahoca, A. (eds.) Data Mining and Knowledge Discovery in Real Life Applications. InTech Publisher (2009)

    Google Scholar 

  37. Velez, F., Agrawal, S.: Data lakes: Discovering, Governing and Transforming Raw Data into Strategic Data Assets. Persistent System (2016)

    Google Scholar 

  38. Murray, C.J., Vos, T., Lozano, R., Naghavi, M., Flaxman, A.D., Michaud, C., Bridgett, L.: Disability-adjusted life years (DALYs) for 291 diseases and injuries in 21 regions, 1990–2010: a systematic analysis for the global burden of disease study 2010. Lancet 380(9859), 2197–2223 (2013)

    Article  Google Scholar 

  39. Bank, World: Indonesia’s Fire and Haze Crisis (2015). http://www.worldbank.org/en/news/feature/2015/12/01/indonesias-fire-and-haze-crisis

  40. Kim, H., Park, Y., Park, K., Yoo, B.: Association between pollen risk indexes, air pollutants, and allergic diseases in korea. Osong Public Heal. Res. Perspect. 7, 172–179 (2016)

    Article  Google Scholar 

  41. Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of International Joint Conference on AI, pp. 1137–1145 (1995)

    Google Scholar 

  42. Verbert, K., Govaerts, S., Duval, E., Santos, J.L., Van Assche, F., Parra, G., et al.: Learning dashboards: an overview and future research opportunities. Personal and Ubiquitous Computing 18(6), 1499e1514 (2014)

    Google Scholar 

Download references

Acknowledgements

This work is supported by Ministry of Higher Education Malaysia (MOHE), Ministry of Science, Technology and Innovation Malaysia (MOSTI), and Universiti Teknologi Malaysia (UTM). This paper is financially supported by E-Science Fund, R.J130000.7928.4S117, PRGS Grant, R.J130000.7828.4L680, GUP Tier 1 UTM, Q.J130000.2528.13H48, FRGS Grant, R.J130000.7828.4F634 and IDG Grant, R.J130000.7728.4J170. The Authors would like to express their deepest gratitude to the Research Management Centre (RMC), UTM for the support in research and development and Soft Computing Research Group (SCRG) for the inspiration and make this research success. Authors would also like to thank the anonymous reviewers who have contributed enormously to this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarina Sulaiman .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Wibowo, M., Sulaiman, S., Shamsuddin, S.M. (2017). Machine Learning in Data Lake for Combining Data Silos. In: Tan, Y., Takagi, H., Shi, Y. (eds) Data Mining and Big Data. DMBD 2017. Lecture Notes in Computer Science(), vol 10387. Springer, Cham. https://doi.org/10.1007/978-3-319-61845-6_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-61845-6_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-61844-9

  • Online ISBN: 978-3-319-61845-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics