A Query Processing Framework for Large-Scale Scientific Data Analysis

Fegaras, Leonidas

doi:10.1007/978-3-662-58384-5_5

Leonidas Fegaras¹⁷

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 11250))

274 Accesses
1 Citations

Abstract

Current scientific applications must analyze enormous amounts of array data using complex mathematical data processing methods. This paper describes a distributed query processing framework for large-scale scientific data analysis that captures array-based computations using SQL-like queries and optimizes and evaluates these computations using state-of-the-art parallel processing algorithms. Instead of providing a library of concrete distributed algorithms that implement certain matrix operations efficiently, we generalize these algorithms by making them parametric in such a way that the same efficient implementations that apply to the concrete algorithms can also apply to their generic counterparts. By specifying matrix operations as generic algebraic operators, we are able to perform inter-operator optimizations, such as fusing matrix transpose with matrix multiplication, resulting to new instantiations of the generic algebraic operators, without having to introduce new efficient algorithms on the fly. We report on a prototype implementation of our framework on three Big Data platforms: Hadoop Map-Reduce, Apache Spark, and Apache Flink, using Apache MRQL, which is a query processing and optimization system for large-scale, distributed data analysis. Finally, we evaluate the effectiveness of our framework through experiments on three queries: a matrix multiplication query, a simple query that combines matrix multiplication with matrix transpose, and a complex iterative query for matrix factorization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD 2015 (2015)
Google Scholar
Apache Flink (2018). http://flink.apache.org/
Apache Hadoop (2018). http://hadoop.apache.org/
Apache Hama (2018). http://hama.apache.org/
Apache Hive (2018). http://hive.apache.org/
Apache Giraph (2018). http://giraph.apache.org/
GraphX: Apache Spark’s API for Graphs and Graph-Parallel Computation (2018). https://spark.apache.org/graphx/
Apache MRQL (incubating) (2018). http://mrql.incubator.apache.org/
Apache Spark (2018). http://spark.apache.org/
Battre, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: 1st ACM Symposium on Cloud computing (SOCC 2010), pp. 119–130 (2010)
Google Scholar
Buck, J., et al.: SciHadoop: array-based query processing in hadoop. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2011)
Google Scholar
Chaiken, R., et al.: SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. (PVLDB) 1(2), 1265–1276 (2008)
Article Google Scholar
A. Das, F.N. Afrati, S. Salihoglu, and J.D. Ullman. Upper and lower bounds on the cost of a map-reduce computation. In VLDB 2013 (2013)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI 2004 (2004)
Google Scholar
Fan, J., et al.: The case against specialized graph analytics engines. In: CIDR (2015)
Google Scholar
Fegaras, L.: A query processing framework for array-based computations. In: Hartmann, S., Ma, H. (eds.) DEXA 2016, Part I. LNCS, vol. 9827, pp. 240–254. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44403-1_15
Chapter Google Scholar
Fegaras, L.: An Algebra for Distributed Big Data Analytics. Journal of Functional Programming, Special issue on Programming Languages for Big Data, Volume 27 (2017)
Google Scholar
Fegaras, L., Li, C., Gupta, U.: An optimization framework for map-reduce queries. In: EDBT 2012 (2012)
Google Scholar
Fegaras, L., Li, C., Gupta, U., Philip, J.J.: XML query optimization in map-reduce. In: International Workshop on the Web and Databases (WebDB) (2011)
Google Scholar
Fegaras, L., Maier, D.: Towards an effective calculus for object query languages. In: International Conference on Management of Data (SIGMOD), pp. 47–58 (1995)
Article Google Scholar
Fegaras, L., Maier, D.: Optimizing object queries using an effective calculus. ACM Trans. Database Syst. (TODS) 25(4), 457–516 (2000)
Article Google Scholar
Folk, M., Heber, G., Koziol, Q., Pourmal, E., Robinson, D.: An overview of the HDF5 technology suite and its applications. In: EDBT/ICDT Workshop on Array Databases (2011)
Google Scholar
Gates, A.F., et al.: Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. VLDB Endow. (PVLDB) 2(2), 1414–1425 (2009)
Article Google Scholar
Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurr. Pract. Exp. 9(4), 255–274 (1997)
Article Google Scholar
Geng, Y., Huang, X., Zhu, M., Ruan, H., Yang, G.: SciHive: array-based query processing with HiveQL. In: IEEE International Conference on Trust, Security and Privacy in Computing and Communications (Trustcom) (2013)
Google Scholar
Jindal, A., et al.: Vertexica: your relational friend for graph analytics!. PVLDB 7(13), 1669–1672 (2014)
Google Scholar
Ghoting, A., et al.: SystemML: declarative machine learning on mapreduce. In: IEEE International Conference on Data Engineering (ICDE) (2011)
Google Scholar
Isard, M., Yu, Y.: Distributed data-parallel computing using a high-level programming language. In: ACM SIGMOD International Conference on Management of Data, pp. 987–994 (2009)
Google Scholar
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. In: IEEE Computer, August 2009
Google Scholar
Kraska, T., Talwalkar, A., Duchi, J., Griffith, R., Franklin, M., Jordan, M.I.: MLbase: a distributed machine learning system. In: Conference on Innovative Data Systems Research (2013)
Google Scholar
Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers, San Rafael (2010)
Book Google Scholar
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. In: VLDB 2012 (2012)
Article Google Scholar
Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: ACM SIGMOD International Conference on Management of Data, pp. 135–146 (2010)
Google Scholar
Meng, X., Bradley, J., Yavuz, B., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17, 1–7 (2016)
MathSciNet MATH Google Scholar
NetCDF: Network Common Data Form. https://www.unidata.ucar.edu/software/netcdf/
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-Foreign language for data processing. In: ACM SIGMOD International Conference on Management of Data (2008)
Google Scholar
Papadopoulos, S., Datta, K., Madden, S., Mattson, T.: The TileDB array data storage manager. PVLDB 10(4), 349–360 (2016)
Google Scholar
Soroush, E., Balazinska, M., Wang, D.: ArrayStore: a storage manager for complex parallel array processing. In: ACM SIGMOD International Conference on Management of Data (2011)
Google Scholar
Soroush, E., Balazinska, M., Krughoff, S., Connolly, A.: Efficient iterative processing in the SciDB parallel array engine. In: 27th International Conference on Scientific and Statistical Database Management (SSDBM) (2015)
Google Scholar
Shinnar, A., Cunningham, D., Herta, B., Saraswat, B.: M3R: Increased performance for in-memory Hadoop jobs. In: VLDB 2012 (2012)
Article Google Scholar
The SciDB Development Team. Overview of SciDB: large scale array storage, processing and analysis. In: ACM SIGMOD International Conference on Management of Data (2010)
Google Scholar
Thusoo, A., et al.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. (PVLDB) 2(2), 1626–1629 (2009)
Article Google Scholar
Thusoo, A., et al.: Hive: a petabyte scale data warehouse using hadoop. In: IEEE International Conference on Data Engineering (ICDE), pp. 996–1005 (2010)
Google Scholar
Valiant, L.G.: A bridging model for parallel computation. CACM 33(8), 103–111 (1990)
Article Google Scholar
Wang, Y., Jiang, W., Agrawal, G.: SciMATE: a novel MapReduce-like framework for multiple scientific data formats. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (2012)
Google Scholar
Yu, Y., et al.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: Symposium on Operating Systems Design and Implementation (OSDI) (2008)
Google Scholar

Download references

Acknowledgments

Our performance evaluations were performed at the Chameleon cloud computing infrastructure, www.chameleoncloud.org, supported by NSF.

Author information

Authors and Affiliations

University of Texas at Arlington, Arlington, TX, 76019-0015, USA
Leonidas Fegaras

Authors

Leonidas Fegaras
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leonidas Fegaras .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
FAW, University of Linz, Linz, Austria
Roland Wagner
Clausthal University of Technology, Clausthal-Zellerfeld, Germany
Sven Hartmann
Victoria University of Wellington, Wellington, New Zealand
Hui Ma

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Fegaras, L. (2018). A Query Processing Framework for Large-Scale Scientific Data Analysis. In: Hameurlain, A., Wagner, R., Hartmann, S., Ma, H. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXVIII. Lecture Notes in Computer Science(), vol 11250. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58384-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-662-58384-5_5
Published: 22 November 2018
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-58383-8
Online ISBN: 978-3-662-58384-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics