Abstract
In data analytics, researchers often work on the same data-sets investigating different aspects and moreover develop their programs in an incremental manner. This opens opportunities to share and recycle results from previously executed jobs if they contain identical operations, e.g., restructuring, filtering and other kinds of data preparation.
In this paper, we present an approach to accelerate processing of such dataflow programs by materializing and recycling (intermediate) results in Apache Spark. We have implemented this idea in our Pig Latin compiler for Spark called Piglet which transparently supports both, merging of multiple jobs as well as rewriting jobs to reuse intermediate results. We discuss the opportunities for recycling, present a profiling-based cost model as well as a decision model to identify potentially beneficial materialization points. Finally, we report results of our experimental evaluation showing the validity of the cost model and the benefit of recycling.
S. Hagedorn—This work was partially funded by the German Research Foundation (DFG) under grant no. SA782/22.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
In reality, in the model there is only one edge between the nodes. The multiple edges are just for illustrating the different jobs.
- 4.
This requires that the clocks on all nodes are synchronized, of course. For example via NTP.
- 5.
- 6.
- 7.
References
Abiteboul, S., Duschka, O.M.: Complexity of answering queries using materialized views. In: PODS, pp. 254–263 (1998)
Camacho-Rodrguez, et al.: PigReuse: A Reuse-based Optimizer for Pig Latin. Technical report, Inria Saclay (2016)
Chao-Qiang, H., et al.: RDDShare: reusing results of spark RDD. In: DSC, pp. 370–375 (2016)
Chirkova, R., Halevy, A.Y., Suciu, D.: A formal perspective on the view selection problem. In: VLDB, pp. 59–68 (2001)
Elghandour, I., Aboulnaga, A.: Restore: reusing results of mapreduce jobs. In: VLDB, vol. 5, pp. 586–597 (2012)
Halevy, A.Y.: Answering queries using views: a survey. VLDB J. 10(4), 270–294 (2001)
Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes efficiently. SIGMOD Rec. 25(2), 205–216 (1996)
Idreos, S., et al.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 585–597 (2011)
Larson, P.Å., Yang, H.Z.: Computing queries from derived relations: theoretical foundation. University of Waterloo, Department of Computer Science (1987)
Nykiel, T., et al.: MRShare: sharing across multiple queries in MapReduce. PVLDB 3(1–2), 494–505 (2010)
Perez, L.L., Jermaine, C.M.: History-aware query optimization with materialized intermediate views. In: ICDE, pp. 520–531. IEEE, March 2014
Sattler, K., Geist, I., Schallehn, E.: QUIET: continuous query-driven index tuning. In: VLDB, pp. 1129–1132 (2003)
Schnaitter, K., Abiteboul, S., Milo, T., Polyzotis, N.: COLT: continuous on-line tuning. In: SIGMOD, pp. 793–795 (2006)
Sparks, E.R., et al.: KeystoneML: optimizing pipelines for large-scale advanced analytics. In: ICDE, pp. 535–546 (2017)
Srivastava, D., Dar, S., Jagadish, H.V., Levy, A.Y.: Answering queries with aggregation using views. In: VLDB, vol. 96, pp. 318–329 (1996)
Valentin, G., et al.: DB2 advisor: an optimizer smart enough to recommend its own indexes. In: ICDE, pp. 101–110 (2000)
Wang, G., Chan, C.-Y.: Multi-query optimization in MapReduce framework. In: PVLDB, pp. 145–156 (2013)
Yang, H.Z., Larson, P.Å.: Query transformation for PSJ-queries. In: PVLDB, vol. 87, pp. 245–254 (1987)
Zhang, Y., Duc, P.M., Corcho, O., Calbimonte, J.-P.: SRBench: a streaming RDF/SPARQL benchmark. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 641–657. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_40
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Hagedorn, S., Sattler, KU. (2018). Cost-Based Sharing and Recycling of (Intermediate) Results in Dataflow Programs. In: Benczúr, A., Thalheim, B., Horváth, T. (eds) Advances in Databases and Information Systems. ADBIS 2018. Lecture Notes in Computer Science(), vol 11019. Springer, Cham. https://doi.org/10.1007/978-3-319-98398-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-98398-1_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98397-4
Online ISBN: 978-3-319-98398-1
eBook Packages: Computer ScienceComputer Science (R0)