LDM: Lineage-Aware Data Management in Multi-tier Storage Systems

Mishra, Pratik; Somani, Arun K.

doi:10.1007/978-3-030-12388-8_48

Pratik Mishra⁴ &
Arun K. Somani⁴

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 69))

Included in the following conference series:

Future of Information and Communication Conference

1320 Accesses
1 Citations

Abstract

We design and develop LDM, a novel data management solution to cater the needs of applications exhibiting the lineage property, i.e. in which the current writes are future reads. In such a class of applications, slow writes significantly hurt the over-all performance of jobs, i.e. current writes determine the fate of next reads. We believe that in a large scale shared production cluster, the issues associated due to data management can be mitigated at a way higher layer in the hierarchy of the I/O path, even before requests to data access are made. Contrary to the current solutions to data management which are mostly reactive and/or based on heuristics, LDM is both deterministic and pro-active. We develop block-graphs, which enable LDM to capture the complete time-based data-task dependency associations, therefore use it to perform life-cycle management through tiering of data blocks. LDM amalgamates the information from the entire data center ecosystem, right from the application code, to file system mappings, the compute and storage devices topology, etc. to make oracle-like deterministic data management decisions. With trace-driven experiments, LDM is able to achieve 29–52% reduction in over-all data center workload execution time. Moreover, by deploying LDM with extensive pre-processing creates efficient data consumption pipelines, which also reduces write and read delays significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Storage refers to the overall data plane, whereas a storage node refers to a single physical device.
2.
Storage media across all nodes with similar I/O characteristics form a tier.
3.
Tiering refers to orchestrating data between heterogeneous tiers of storage by leveraging individual strengths of each to maintain balance between Cost, Performance and Capacity.
4.
Please note, we collected (stored) the traces remotely on a different machine through the network and not stored in the same local HDFS disk for maintaining the purity of the traces & minimize the effects of the SCSI bus [54,55,56].

References

Balasubramonian, R., Chang, J., Manning, T., Moreno, J.H., Murphy, R., Nair, R., Swanson, S.: Near-data processing: insights from a MICRO-46 workshop. IEEE Micro 34(4), 36–42 (2014)
Article Google Scholar
Tiwari, D., Vazhkudai, S.S., Kim, Y., Ma, X., Boboila, S., Desnoyers, P.J.: Reducing data movement costs using energy-efficient, active computation on SSD. Presented as Part of the 2012 Workshop on Power-Aware Computing and Systems (2012)
Google Scholar
Mishra, P., Somani, A.K.: Host managed contention avoidance storage solutions for big data. J. Big Data 4, 18 (2017)
Google Scholar
Iliadis, I., Jelitto, J., Kim, Y., Sarafijanovic, S., Venkatesan, V.: Exaplan: queueing-based data placement and provisioning for large tiered storage systems. In: 2015 IEEE 23rd International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 218–227, October 2015
Google Scholar
Kim, Y., Gupta, A., Urgaonkar, B., Berman, P., Sivasubramaniam, A.: Hybridstore: a cost-efficient, high-performance storage system combining SSDs and HDDs. In: 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, pp. 227–236, July 2011
Google Scholar
Mishra, P., Mishra, M., Somani, A.K.: Bulk i/o storage management for big data applications. In: IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 412–417. IEEE (2016)
Google Scholar
Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R.H., Shenker, S., Stoica, I.: Mesos: a platform for fine-grained resource sharing in the data center. In: NSDI, vol. 11, p. 22 (2011)
Google Scholar
Bjørling, M., Axboe, J., Nellans, D., Bonnet, P.: Linux block IO: introducing multi-queue SSD access on multi-core systems. In: Proceedings of the 6th International Systems and Storage Conference, SYSTOR 2013, pp. 22:1–22:10. ACM, New York (2013). http://doi.acm.org/10.1145/2485732.2485740
Bhadkamkar, M., Guerra, J., Useche, L., Burnett, S., Liptak, J., Rangaswami, R., Hristidis, V.: Borg: block-reorganization for self-optimizing storage systems. In: Proccedings of the 7th Conference on File and Storage Technologies, FAST 2009, pp. 183–196. USENIX Association, Berkeley (2009). http://dl.acm.org/citation.cfm?id=1525908.1525922
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 2. USENIX Association (2012)
Google Scholar
Li, H., Ghodsi, A., Zaharia, M., Shenker, S., Stoica, I.: Tachyon: reliable, memory speed storage for cluster computing frameworks. In: Proceedings of the ACM Symposium on Cloud Computing, pp. 1–15. ACM (2014)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Article Google Scholar
Somani, A.K., Mishra, M., Mishra, P.: Applications of hadoop ecosystems tools. In: NoSQL, pp. 173–190. Chapman and Hall/CRC (2017)
Google Scholar
Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. In: Proceedings of VLDB Endowment, vol. 3, no. 1–2, pp. 285–296, September 2010. http://dx.doi.org/10.14778/1920841.1920881
Article Google Scholar
Ananthanarayanan, G., Ghodsi, A., Wang, A., Borthakur, D., Kandula, S., Shenker, S., Stoica, I.: Pacman: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, p. 20. USENIX Association (2012)
Google Scholar
Sun, E.L., Das, S.: Data-dependency-driven flow execution. US Patent App. 15/249,841, 1 March 2018
Google Scholar
Aggarwal, A., Arasan, R., Bose, S., Das, D., Kaushik, R.K., Meyer, M.K., Ramasamy, G., Seideman, J.D.: System and method for automatically capturing and recording lineage data for big data records. US Patent App. 14/944,849, 18 May 2017
Google Scholar
MacLeod, S.P., Kiernan, C.L., Rajarajan, V.: Data lineage data type. US Patent 6,434,558, 13 August 2002
Google Scholar
Peacock, A.J., Couris, C., Storm, C., Netz, A., Cheung, C.Y., Flasko, M.J., Grealish, K., Della-Libera, G.M., Carlson, S.P., Heninger, M.W., et al.: Job scheduling and monitoring. US Patent App. 15/596,709, 5 October 2017
Google Scholar
Benjelloun, O., Sarma, A.D., Halevy, A., Widom, J.: ULDBs: databases with uncertainty and lineage. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 953–964. VLDB Endowment (2006)
Google Scholar
Krajec, R.S., Gounares, A.G.: Highlighting of time series data on force directed graph. US Patent 9,323,863, 26 April 2016
Google Scholar
Mohammad, A.S., Boston, M.A., Lapsker, I.: Data lineage transformation analysis. US Patent 9,348,879, 24 May 2016
Google Scholar
Woodruff, A., Stonebraker, M.: Supporting fine-grained data lineage in a database visualization environment. In: 1997 Proceedings of 13th International Conference on Data Engineering, pp. 91–102. IEEE (1997)
Google Scholar
Bose, R.: A conceptual framework for composing and managing scientific data lineage. In: 2002 Proceedings of 14th International Conference on Scientific and Statistical Database Management, pp. 15–19. IEEE (2002)
Google Scholar
Bose, R., Frew, J.: Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv. (CSUR) 37(1), 1–28 (2005)
Article Google Scholar
Widom, J.: Trio: a system for integrated management of data, accuracy, and lineage. Technical report, Stanford InfoLab (2004)
Google Scholar
Ré, C., Suciu, D.: Approximate lineage for probabilistic databases. Proc. VLDB Endow. 1(1), 797–808 (2008)
Article Google Scholar
Harter, T., Borthakur, D., Dong, S., Aiyer, A., Tang, L., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Analysis of HDFS under HBase: a Facebook messages case study. In: Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST 2014), pp. 199–212. USENIX, Santa Clara (2014). https://www.usenix.org/conference/fast14/technical-sessions/presentation/harter
Choi, D., Jeon, M., Kim, N., Lee, B.-D.: An enhanced data-locality-aware task scheduling algorithm for hadoop applications. IEEE Syst. J. PP, 1–12 (2017)
Google Scholar
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology, pp. 99–110. ACM (2010)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10, 10–10 (2010)
Google Scholar
Islam, N.S., Lu, X., Wasi-ur Rahman, M., Shankar, D., Panda, D.K.: Triple-H: a hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture. In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 101–110, May 2015
Google Scholar
Krish, K.R., Anwar, A., Butt, A.R.: hatS: a heterogeneity-aware tiered storage for hadoop. In: 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 502–511, May 2014
Google Scholar
Kakoulli, E., Herodotou, H.: Octopusfs: a distributed file system with tiered storage management. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 65–78. ACM (2017)
Google Scholar
Grund, M., Krüger, J., Plattner, H., Zeier, A., Cudre-Mauroux, P., Madden, S.: HYRISE: a main memory hybrid storage engine. Proc. VLDB Endow. 4(2), 105–116 (2010)
Article Google Scholar
Islam, N.S., Wasi-ur Rahman, M., Lu, X., Panda, D.K.D.: Efficient data access strategies for hadoop and spark on HPC cluster with heterogeneous storage. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 223–232. IEEE (2016)
Google Scholar
Lee, S., Jo, J.-Y., Kim, Y.: Performance improvement of mapreduce process by promoting deep data locality. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 292–301. IEEE (2016)
Google Scholar
Gunda, P.K., Ravindranath, L., Thekkath, C.A., Yu, Y., Zhuang, L.: Nectar: automatic management of data and computation in datacenters. In: OSDI, vol. 10, pp. 1–8 (2010)
Google Scholar
Olson, J.T., Patil, S.R., Shiraguppi, R.M., Spear, G.A.: Handling data block migration to efficiently utilize higher performance tiers in a multi-tier storage environment. US Patent App. 15/499,274, 27 April 2017
Google Scholar
Zhang, G., Chiu, L., Liu, L.: Adaptive data migration in multi-tiered storage based cloud environment. In: 2010 IEEE 3rd International Conference on Cloud Computing (CLOUD), pp. 148–155. IEEE (2010)
Google Scholar
Mihailescu, M., Soundararajan, G., Amza, C.: Mixapart: decoupled analytics for shared storage systems. In: In USENIX FAST (2012)
Google Scholar
Krish, K., Wadhwa, B., Iqbal, M.S., Rafique, M.M., Butt, A.R.: On efficient hierarchical storage for big data processing. In: 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pp. 403–408. IEEE (2016)
Google Scholar
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC 2013, pp. 5:1–5:16. ACM, New York (2013). http://doi.acm.org/10.1145/2523616.2523633
Lu, W., Wang, Y., Jiang, J., Liu, J., Shen, Y., Wei, B.: Hybrid storage architecture and efficient mapreduce processing for unstructured data. Parallel Comput. 69, 63–77 (2017)
Article MathSciNet Google Scholar
Ananthanarayanan, G., Agarwal, S., Kandula, S., Greenberg, A., Stoica, I., Harlan, D., Harris, E.: Scarlett: coping with skewed content popularity in mapreduce clusters. In: Proceedings of the Sixth Conference on Computer Systems, pp. 287–300. ACM (2011)
Google Scholar
Moon, S., Lee, J., Sun, X., Kee, Y.-S.: Optimizing the hadoop mapreduce framework with high-performance storage devices. J. Supercomput. 71(9), 3525–3548 (2015). http://dx.doi.org/10.1007/s11227-015-1447-3
Article Google Scholar
Di Mauro, M., Longo, M., Postiglione, F., Carullo, G., Tambasco, M.: Software defined storage: availability modeling and sensitivity analysis. In: International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS), pp. 1–7. IEEE (2017)
Google Scholar
Lee, M., Kang, D.H., Lee, M., Eom, Y.I.: Improving read performance by isolating multiple queues in NVMe SSDs. In: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication, p. 36. ACM (2017)
Google Scholar
Malladi, K.T., Awasthi, M., Zheng, H.: Flexdrive: a framework to explore NVMe storage solutions. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp. 1115–1122. IEEE (2016)
Google Scholar
Khasnabish, J.N., Mithani, M.F., Rao, S.: Tier-centric resource allocation in multi-tier cloud systems. IEEE Trans. Cloud Comput. 5(3), 576–589 (2017)
Article Google Scholar
Spivak, A., Razumovskiy, A., Nasonov, D., Boukhanovsky, A., Redice, A.: Storage tier-aware replicative data reorganization with prioritization for efficient workload processing. Future Gen. Comput. Syst. 79, 618–629 (2017)
Article Google Scholar
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 41–51, March 2010
Google Scholar
TPC(tm). Tpc express benchmark(tm) hs (tpcx-hs)-hadoop suite overview (2016). http://www.tpc.org/tpcx-hs/
Brunelle, A.D.: Blktrace user guide, February 2007
Google Scholar
Riska, A., Larkby-Lahet, J., Riedel, E.: Evaluating block-level optimization through the io path. In: 2007 Proceedings of the USENIX Annual Technical Conference, ATC 2007, pp. 19:1–19:14. USENIX Association, Berkeley (2007). http://dl.acm.org/citation.cfm?id=1364385.1364404
Chen, F., Koufaty, D.A., Zhang, X.: Hystor: making the best use of solid state drives in high performance storage systems. In: Proceedings of the International Conference on Supercomputing, pp. 22–32. ACM, New York (2011). http://doi.acm.org/10.1145/1995896.1995902

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Iowa State University, Ames, IA, 50014, USA
Pratik Mishra & Arun K. Somani

Authors

Pratik Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Arun K. Somani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pratik Mishra .

Editor information

Editors and Affiliations

Faculty of Science and Engineering, Saga University, Saga, Japan
Kohei Arai
The Science and Information (SAI) Organization, Bradford, UK
Rahul Bhatia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mishra, P., Somani, A.K. (2020). LDM: Lineage-Aware Data Management in Multi-tier Storage Systems. In: Arai, K., Bhatia, R. (eds) Advances in Information and Communication. FICC 2019. Lecture Notes in Networks and Systems, vol 69. Springer, Cham. https://doi.org/10.1007/978-3-030-12388-8_48

Download citation

DOI: https://doi.org/10.1007/978-3-030-12388-8_48
Published: 02 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-12387-1
Online ISBN: 978-3-030-12388-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics