Skip to main content

LDM: Lineage-Aware Data Management in Multi-tier Storage Systems

  • Conference paper
  • First Online:
Advances in Information and Communication (FICC 2019)

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 69))

Included in the following conference series:

Abstract

We design and develop LDM, a novel data management solution to cater the needs of applications exhibiting the lineage property, i.e. in which the current writes are future reads. In such a class of applications, slow writes significantly hurt the over-all performance of jobs, i.e. current writes determine the fate of next reads. We believe that in a large scale shared production cluster, the issues associated due to data management can be mitigated at a way higher layer in the hierarchy of the I/O path, even before requests to data access are made. Contrary to the current solutions to data management which are mostly reactive and/or based on heuristics, LDM is both deterministic and pro-active. We develop block-graphs, which enable LDM to capture the complete time-based data-task dependency associations, therefore use it to perform life-cycle management through tiering of data blocks. LDM amalgamates the information from the entire data center ecosystem, right from the application code, to file system mappings, the compute and storage devices topology, etc. to make oracle-like deterministic data management decisions. With trace-driven experiments, LDM is able to achieve 29–52% reduction in over-all data center workload execution time. Moreover, by deploying LDM with extensive pre-processing creates efficient data consumption pipelines, which also reduces write and read delays significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Storage refers to the overall data plane, whereas a storage node refers to a single physical device.

  2. 2.

    Storage media across all nodes with similar I/O characteristics form a tier.

  3. 3.

    Tiering refers to orchestrating data between heterogeneous tiers of storage by leveraging individual strengths of each to maintain balance between Cost, Performance and Capacity.

  4. 4.

    Please note, we collected (stored) the traces remotely on a different machine through the network and not stored in the same local HDFS disk for maintaining the purity of the traces & minimize the effects of the SCSI bus [54,55,56].

References

  1. Balasubramonian, R., Chang, J., Manning, T., Moreno, J.H., Murphy, R., Nair, R., Swanson, S.: Near-data processing: insights from a MICRO-46 workshop. IEEE Micro 34(4), 36–42 (2014)

    Article  Google Scholar 

  2. Tiwari, D., Vazhkudai, S.S., Kim, Y., Ma, X., Boboila, S., Desnoyers, P.J.: Reducing data movement costs using energy-efficient, active computation on SSD. Presented as Part of the 2012 Workshop on Power-Aware Computing and Systems (2012)

    Google Scholar 

  3. Mishra, P., Somani, A.K.: Host managed contention avoidance storage solutions for big data. J. Big Data 4, 18 (2017)

    Google Scholar 

  4. Iliadis, I., Jelitto, J., Kim, Y., Sarafijanovic, S., Venkatesan, V.: Exaplan: queueing-based data placement and provisioning for large tiered storage systems. In: 2015 IEEE 23rd International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 218–227, October 2015

    Google Scholar 

  5. Kim, Y., Gupta, A., Urgaonkar, B., Berman, P., Sivasubramaniam, A.: Hybridstore: a cost-efficient, high-performance storage system combining SSDs and HDDs. In: 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 227–236, July 2011

    Google Scholar 

  6. Mishra, P., Mishra, M., Somani, A.K.: Bulk i/o storage management for big data applications. In: IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 412–417. IEEE (2016)

    Google Scholar 

  7. Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R.H., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, vol. 11, p. 22 (2011)

    Google Scholar 

  8. Bjørling, M., Axboe, J., Nellans, D., Bonnet, P.: Linux block IO: introducing multi-queue SSD access on multi-core systems. In: Proceedings of the 6th International Systems and Storage Conference, SYSTOR 2013, pp. 22:1–22:10. ACM, New York (2013). http://doi.acm.org/10.1145/2485732.2485740

  9. Bhadkamkar, M., Guerra, J., Useche, L., Burnett, S., Liptak, J., Rangaswami, R., Hristidis, V.: Borg: block-reorganization for self-optimizing storage systems. In: Proccedings of the 7th Conference on File and Storage Technologies, FAST 2009, pp. 183–196. USENIX Association, Berkeley (2009). http://dl.acm.org/citation.cfm?id=1525908.1525922

  10. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)

    Google Scholar 

  11. Li, H., Ghodsi, A., Zaharia, M., Shenker, S., Stoica, I.: Tachyon: reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 1–15. ACM (2014)

    Google Scholar 

  12. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)

    Article  Google Scholar 

  13. Somani, A.K., Mishra, M., Mishra, P.: Applications of hadoop ecosystems tools. In: NoSQL, pp. 173–190. Chapman and Hall/CRC (2017)

    Google Scholar 

  14. Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. In: Proceedings of VLDB Endowment, vol. 3, no. 1–2, pp. 285–296, September 2010. http://dx.doi.org/10.14778/1920841.1920881

    Article  Google Scholar 

  15. Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: Pacman: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 20. USENIX Association (2012)

    Google Scholar 

  16. Sun, E.L., Das, S.: Data-dependency-driven flow execution. US Patent App. 15/249,841, 1 March 2018

    Google Scholar 

  17. Aggarwal, A., Arasan, R., Bose, S., Das, D., Kaushik, R.K., Meyer, M.K., Ramasamy, G., Seideman, J.D.: System and method for automatically capturing and recording lineage data for big data records. US Patent App. 14/944,849, 18 May 2017

    Google Scholar 

  18. MacLeod, S.P., Kiernan, C.L., Rajarajan, V.: Data lineage data type. US Patent 6,434,558, 13 August 2002

    Google Scholar 

  19. Peacock, A.J., Couris, C., Storm, C., Netz, A., Cheung, C.Y., Flasko, M.J., Grealish, K., Della-Libera, G.M., Carlson, S.P., Heninger, M.W., et al.: Job scheduling and monitoring. US Patent App. 15/596,709, 5 October 2017

    Google Scholar 

  20. Benjelloun, O., Sarma, A.D., Halevy, A., Widom, J.: ULDBs: databases with uncertainty and lineage. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 953–964. VLDB Endowment (2006)

    Google Scholar 

  21. Krajec, R.S., Gounares, A.G.: Highlighting of time series data on force directed graph. US Patent 9,323,863, 26 April 2016

    Google Scholar 

  22. Mohammad, A.S., Boston, M.A., Lapsker, I.: Data lineage transformation analysis. US Patent 9,348,879, 24 May 2016

    Google Scholar 

  23. Woodruff, A., Stonebraker, M.: Supporting fine-grained data lineage in a database visualization environment. In: 1997 Proceedings of 13th International Conference on Data Engineering, pp. 91–102. IEEE (1997)

    Google Scholar 

  24. Bose, R.: A conceptual framework for composing and managing scientific data lineage. In: 2002 Proceedings of 14th International Conference on Scientific and Statistical Database Management, pp. 15–19. IEEE (2002)

    Google Scholar 

  25. Bose, R., Frew, J.: Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv. (CSUR) 37(1), 1–28 (2005)

    Article  Google Scholar 

  26. Widom, J.: Trio: a system for integrated management of data, accuracy, and lineage. Technical report, Stanford InfoLab (2004)

    Google Scholar 

  27. Ré, C., Suciu, D.: Approximate lineage for probabilistic databases. Proc. VLDB Endow. 1(1), 797–808 (2008)

    Article  Google Scholar 

  28. Harter, T., Borthakur, D., Dong, S., Aiyer, A., Tang, L., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Analysis of HDFS under HBase: a Facebook messages case study. In: Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 2014), pp. 199–212. USENIX, Santa Clara (2014). https://www.usenix.org/conference/fast14/technical-sessions/presentation/harter

  29. Choi, D., Jeon, M., Kim, N., Lee, B.-D.: An enhanced data-locality-aware task scheduling algorithm for hadoop applications. IEEE Syst. J. PP, 1–12 (2017)

    Google Scholar 

  30. Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 99–110. ACM (2010)

    Google Scholar 

  31. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10–10 (2010)

    Google Scholar 

  32. Islam, N.S., Lu, X., Wasi-ur Rahman, M., Shankar, D., Panda, D.K.: Triple-H: a hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 101–110, May 2015

    Google Scholar 

  33. Krish, K.R., Anwar, A., Butt, A.R.: hatS: a heterogeneity-aware tiered storage for hadoop. In: 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 502–511, May 2014

    Google Scholar 

  34. Kakoulli, E., Herodotou, H.: Octopusfs: a distributed file system with tiered storage management. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 65–78. ACM (2017)

    Google Scholar 

  35. Grund, M., Krüger, J., Plattner, H., Zeier, A., Cudre-Mauroux, P., Madden, S.: HYRISE: a main memory hybrid storage engine. Proc. VLDB Endow. 4(2), 105–116 (2010)

    Article  Google Scholar 

  36. Islam, N.S., Wasi-ur Rahman, M., Lu, X., Panda, D.K.D.: Efficient data access strategies for hadoop and spark on HPC cluster with heterogeneous storage. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 223–232. IEEE (2016)

    Google Scholar 

  37. Lee, S., Jo, J.-Y., Kim, Y.: Performance improvement of mapreduce process by promoting deep data locality. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 292–301. IEEE (2016)

    Google Scholar 

  38. Gunda, P.K., Ravindranath, L., Thekkath, C.A., Yu, Y., Zhuang, L.: Nectar: automatic management of data and computation in datacenters. In: OSDI, vol. 10, pp. 1–8 (2010)

    Google Scholar 

  39. Olson, J.T., Patil, S.R., Shiraguppi, R.M., Spear, G.A.: Handling data block migration to efficiently utilize higher performance tiers in a multi-tier storage environment. US Patent App. 15/499,274, 27 April 2017

    Google Scholar 

  40. Zhang, G., Chiu, L., Liu, L.: Adaptive data migration in multi-tiered storage based cloud environment. In: 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD), pp. 148–155. IEEE (2010)

    Google Scholar 

  41. Mihailescu, M., Soundararajan, G., Amza, C.: Mixapart: decoupled analytics for shared storage systems. In: In USENIX FAST (2012)

    Google Scholar 

  42. Krish, K., Wadhwa, B., Iqbal, M.S., Rafique, M.M., Butt, A.R.: On efficient hierarchical storage for big data processing. In: 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 403–408. IEEE (2016)

    Google Scholar 

  43. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC 2013, pp. 5:1–5:16. ACM, New York (2013). http://doi.acm.org/10.1145/2523616.2523633

  44. Lu, W., Wang, Y., Jiang, J., Liu, J., Shen, Y., Wei, B.: Hybrid storage architecture and efficient mapreduce processing for unstructured data. Parallel Comput. 69, 63–77 (2017)

    Article  MathSciNet  Google Scholar 

  45. Ananthanarayanan, G., Agarwal, S., Kandula, S., Greenberg, A., Stoica, I., Harlan, D., Harris, E.: Scarlett: coping with skewed content popularity in mapreduce clusters. In: Proceedings of the Sixth Conference on Computer Systems, pp. 287–300. ACM (2011)

    Google Scholar 

  46. Moon, S., Lee, J., Sun, X., Kee, Y.-S.: Optimizing the hadoop mapreduce framework with high-performance storage devices. J. Supercomput. 71(9), 3525–3548 (2015). http://dx.doi.org/10.1007/s11227-015-1447-3

    Article  Google Scholar 

  47. Di Mauro, M., Longo, M., Postiglione, F., Carullo, G., Tambasco, M.: Software defined storage: availability modeling and sensitivity analysis. In: International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS), pp. 1–7. IEEE (2017)

    Google Scholar 

  48. Lee, M., Kang, D.H., Lee, M., Eom, Y.I.: Improving read performance by isolating multiple queues in NVMe SSDs. In: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication, p. 36. ACM (2017)

    Google Scholar 

  49. Malladi, K.T., Awasthi, M., Zheng, H.: Flexdrive: a framework to explore NVMe storage solutions. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 1115–1122. IEEE (2016)

    Google Scholar 

  50. Khasnabish, J.N., Mithani, M.F., Rao, S.: Tier-centric resource allocation in multi-tier cloud systems. IEEE Trans. Cloud Comput. 5(3), 576–589 (2017)

    Article  Google Scholar 

  51. Spivak, A., Razumovskiy, A., Nasonov, D., Boukhanovsky, A., Redice, A.: Storage tier-aware replicative data reorganization with prioritization for efficient workload processing. Future Gen. Comput. Syst. 79, 618–629 (2017)

    Article  Google Scholar 

  52. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 41–51, March 2010

    Google Scholar 

  53. TPC(tm). Tpc express benchmark(tm) hs (tpcx-hs)-hadoop suite overview (2016). http://www.tpc.org/tpcx-hs/

  54. Brunelle, A.D.: Blktrace user guide, February 2007

    Google Scholar 

  55. Riska, A., Larkby-Lahet, J., Riedel, E.: Evaluating block-level optimization through the io path. In: 2007 Proceedings of the USENIX Annual Technical Conference, ATC 2007, pp. 19:1–19:14. USENIX Association, Berkeley (2007). http://dl.acm.org/citation.cfm?id=1364385.1364404

  56. Chen, F., Koufaty, D.A., Zhang, X.: Hystor: making the best use of solid state drives in high performance storage systems. In: Proceedings of the International Conference on Supercomputing, pp. 22–32. ACM, New York (2011). http://doi.acm.org/10.1145/1995896.1995902

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pratik Mishra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mishra, P., Somani, A.K. (2020). LDM: Lineage-Aware Data Management in Multi-tier Storage Systems. In: Arai, K., Bhatia, R. (eds) Advances in Information and Communication. FICC 2019. Lecture Notes in Networks and Systems, vol 69. Springer, Cham. https://doi.org/10.1007/978-3-030-12388-8_48

Download citation

Publish with us

Policies and ethics