Cost-Based Sharing and Recycling of (Intermediate) Results in Dataflow Programs

Hagedorn, Stefan; Sattler, Kai-Uwe

doi:10.1007/978-3-319-98398-1_13

Stefan Hagedorn¹⁶ &
Kai-Uwe Sattler¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11019))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

771 Accesses
3 Citations

Abstract

In data analytics, researchers often work on the same data-sets investigating different aspects and moreover develop their programs in an incremental manner. This opens opportunities to share and recycle results from previously executed jobs if they contain identical operations, e.g., restructuring, filtering and other kinds of data preparation.

In this paper, we present an approach to accelerate processing of such dataflow programs by materializing and recycling (intermediate) results in Apache Spark. We have implemented this idea in our Pig Latin compiler for Spark called Piglet which transparently supports both, merging of multiple jobs as well as rewriting jobs to reuse intermediate results. We discuss the opportunities for recycling, present a profiling-based cost model as well as a decision model to identify potentially beneficial materialization points. Finally, we report results of our experimental evaluation showing the validity of the cost model and the benefit of recycling.

S. Hagedorn—This work was partially funded by the German Research Foundation (DFG) under grant no. SA782/22.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://spark.apache.org/.
2.
https://github.com/dbis-ilm/piglet.
3.
In reality, in the model there is only one edge between the nodes. The multiple edges are just for illustrating the different jobs.
4.
This requires that the clocks on all nodes are synchronized, of course. For example via NTP.
5.
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.
6.
http://www1.nyc.gov/site/planning/data-maps/open-data.page.
7.
https://www.gdeltproject.org/data.html.

References

Abiteboul, S., Duschka, O.M.: Complexity of answering queries using materialized views. In: PODS, pp. 254–263 (1998)
Google Scholar
Camacho-Rodrguez, et al.: PigReuse: A Reuse-based Optimizer for Pig Latin. Technical report, Inria Saclay (2016)
Google Scholar
Chao-Qiang, H., et al.: RDDShare: reusing results of spark RDD. In: DSC, pp. 370–375 (2016)
Google Scholar
Chirkova, R., Halevy, A.Y., Suciu, D.: A formal perspective on the view selection problem. In: VLDB, pp. 59–68 (2001)
Google Scholar
Elghandour, I., Aboulnaga, A.: Restore: reusing results of mapreduce jobs. In: VLDB, vol. 5, pp. 586–597 (2012)
Google Scholar
Halevy, A.Y.: Answering queries using views: a survey. VLDB J. 10(4), 270–294 (2001)
Article Google Scholar
Harinarayan, V., Rajaraman, A., Ullman, J.D.: Implementing data cubes efficiently. SIGMOD Rec. 25(2), 205–216 (1996)
Article Google Scholar
Idreos, S., et al.: Merging what’s cracked, cracking what’s merged: adaptive indexing in main-memory column-stores. PVLDB 4(9), 585–597 (2011)
Google Scholar
Larson, P.Å., Yang, H.Z.: Computing queries from derived relations: theoretical foundation. University of Waterloo, Department of Computer Science (1987)
Google Scholar
Nykiel, T., et al.: MRShare: sharing across multiple queries in MapReduce. PVLDB 3(1–2), 494–505 (2010)
Google Scholar
Perez, L.L., Jermaine, C.M.: History-aware query optimization with materialized intermediate views. In: ICDE, pp. 520–531. IEEE, March 2014
Google Scholar
Sattler, K., Geist, I., Schallehn, E.: QUIET: continuous query-driven index tuning. In: VLDB, pp. 1129–1132 (2003)
Google Scholar
Schnaitter, K., Abiteboul, S., Milo, T., Polyzotis, N.: COLT: continuous on-line tuning. In: SIGMOD, pp. 793–795 (2006)
Google Scholar
Sparks, E.R., et al.: KeystoneML: optimizing pipelines for large-scale advanced analytics. In: ICDE, pp. 535–546 (2017)
Google Scholar
Srivastava, D., Dar, S., Jagadish, H.V., Levy, A.Y.: Answering queries with aggregation using views. In: VLDB, vol. 96, pp. 318–329 (1996)
Google Scholar
Valentin, G., et al.: DB2 advisor: an optimizer smart enough to recommend its own indexes. In: ICDE, pp. 101–110 (2000)
Google Scholar
Wang, G., Chan, C.-Y.: Multi-query optimization in MapReduce framework. In: PVLDB, pp. 145–156 (2013)
Google Scholar
Yang, H.Z., Larson, P.Å.: Query transformation for PSJ-queries. In: PVLDB, vol. 87, pp. 245–254 (1987)
Google Scholar
Zhang, Y., Duc, P.M., Corcho, O., Calbimonte, J.-P.: SRBench: a streaming RDF/SPARQL benchmark. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 641–657. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_40
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Databases and Information Systems Group, TU Ilmenau, Ilmenau, Germany
Stefan Hagedorn & Kai-Uwe Sattler

Authors

Stefan Hagedorn
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Uwe Sattler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefan Hagedorn .

Editor information

Editors and Affiliations

Eötvös Loránd University, Budapest, Hungary
András Benczúr
Christian-Albrechts-Universität, Kiel, Germany
Bernhard Thalheim
Eötvös Loránd University, Budapest, Hungary
Tomáš Horváth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hagedorn, S., Sattler, KU. (2018). Cost-Based Sharing and Recycling of (Intermediate) Results in Dataflow Programs. In: Benczúr, A., Thalheim, B., Horváth, T. (eds) Advances in Databases and Information Systems. ADBIS 2018. Lecture Notes in Computer Science(), vol 11019. Springer, Cham. https://doi.org/10.1007/978-3-319-98398-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-98398-1_13
Published: 29 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98397-4
Online ISBN: 978-3-319-98398-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cost-Based Sharing and Recycling of (Intermediate) Results in Dataflow Programs