Skip to main content

A Query Processing Framework for Large-Scale Scientific Data Analysis

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXVIII

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 11250))

Abstract

Current scientific applications must analyze enormous amounts of array data using complex mathematical data processing methods. This paper describes a distributed query processing framework for large-scale scientific data analysis that captures array-based computations using SQL-like queries and optimizes and evaluates these computations using state-of-the-art parallel processing algorithms. Instead of providing a library of concrete distributed algorithms that implement certain matrix operations efficiently, we generalize these algorithms by making them parametric in such a way that the same efficient implementations that apply to the concrete algorithms can also apply to their generic counterparts. By specifying matrix operations as generic algebraic operators, we are able to perform inter-operator optimizations, such as fusing matrix transpose with matrix multiplication, resulting to new instantiations of the generic algebraic operators, without having to introduce new efficient algorithms on the fly. We report on a prototype implementation of our framework on three Big Data platforms: Hadoop Map-Reduce, Apache Spark, and Apache Flink, using Apache MRQL, which is a query processing and optimization system for large-scale, distributed data analysis. Finally, we evaluate the effectiveness of our framework through experiments on three queries: a matrix multiplication query, a simple query that combines matrix multiplication with matrix transpose, and a complex iterative query for matrix factorization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD 2015 (2015)

    Google Scholar 

  2. Apache Flink (2018). http://flink.apache.org/

  3. Apache Hadoop (2018). http://hadoop.apache.org/

  4. Apache Hama (2018). http://hama.apache.org/

  5. Apache Hive (2018). http://hive.apache.org/

  6. Apache Giraph (2018). http://giraph.apache.org/

  7. GraphX: Apache Spark’s API for Graphs and Graph-Parallel Computation (2018). https://spark.apache.org/graphx/

  8. Apache MRQL (incubating) (2018). http://mrql.incubator.apache.org/

  9. Apache Spark (2018). http://spark.apache.org/

  10. Battre, D., Ewen, S., Hueske, F., Kao, O., Markl, V., Warneke, D.: Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In: 1st ACM Symposium on Cloud computing (SOCC 2010), pp. 119–130 (2010)

    Google Scholar 

  11. Buck, J., et al.: SciHadoop: array-based query processing in hadoop. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2011)

    Google Scholar 

  12. Chaiken, R., et al.: SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. (PVLDB) 1(2), 1265–1276 (2008)

    Article  Google Scholar 

  13. A. Das, F.N. Afrati, S. Salihoglu, and J.D. Ullman. Upper and lower bounds on the cost of a map-reduce computation. In VLDB 2013 (2013)

    Google Scholar 

  14. Dean, J., Ghemawat, S.: MapReduce: Simplified data processing on large clusters. In: OSDI 2004 (2004)

    Google Scholar 

  15. Fan, J., et al.: The case against specialized graph analytics engines. In: CIDR (2015)

    Google Scholar 

  16. Fegaras, L.: A query processing framework for array-based computations. In: Hartmann, S., Ma, H. (eds.) DEXA 2016, Part I. LNCS, vol. 9827, pp. 240–254. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44403-1_15

    Chapter  Google Scholar 

  17. Fegaras, L.: An Algebra for Distributed Big Data Analytics. Journal of Functional Programming, Special issue on Programming Languages for Big Data, Volume 27 (2017)

    Google Scholar 

  18. Fegaras, L., Li, C., Gupta, U.: An optimization framework for map-reduce queries. In: EDBT 2012 (2012)

    Google Scholar 

  19. Fegaras, L., Li, C., Gupta, U., Philip, J.J.: XML query optimization in map-reduce. In: International Workshop on the Web and Databases (WebDB) (2011)

    Google Scholar 

  20. Fegaras, L., Maier, D.: Towards an effective calculus for object query languages. In: International Conference on Management of Data (SIGMOD), pp. 47–58 (1995)

    Article  Google Scholar 

  21. Fegaras, L., Maier, D.: Optimizing object queries using an effective calculus. ACM Trans. Database Syst. (TODS) 25(4), 457–516 (2000)

    Article  Google Scholar 

  22. Folk, M., Heber, G., Koziol, Q., Pourmal, E., Robinson, D.: An overview of the HDF5 technology suite and its applications. In: EDBT/ICDT Workshop on Array Databases (2011)

    Google Scholar 

  23. Gates, A.F., et al.: Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. VLDB Endow. (PVLDB) 2(2), 1414–1425 (2009)

    Article  Google Scholar 

  24. Geijn, R.A., Watts, J.: SUMMA: scalable universal matrix multiplication algorithm. Concurr. Pract. Exp. 9(4), 255–274 (1997)

    Article  Google Scholar 

  25. Geng, Y., Huang, X., Zhu, M., Ruan, H., Yang, G.: SciHive: array-based query processing with HiveQL. In: IEEE International Conference on Trust, Security and Privacy in Computing and Communications (Trustcom) (2013)

    Google Scholar 

  26. Jindal, A., et al.: Vertexica: your relational friend for graph analytics!. PVLDB 7(13), 1669–1672 (2014)

    Google Scholar 

  27. Ghoting, A., et al.: SystemML: declarative machine learning on mapreduce. In: IEEE International Conference on Data Engineering (ICDE) (2011)

    Google Scholar 

  28. Isard, M., Yu, Y.: Distributed data-parallel computing using a high-level programming language. In: ACM SIGMOD International Conference on Management of Data, pp. 987–994 (2009)

    Google Scholar 

  29. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. In: IEEE Computer, August 2009

    Google Scholar 

  30. Kraska, T., Talwalkar, A., Duchi, J., Griffith, R., Franklin, M., Jordan, M.I.: MLbase: a distributed machine learning system. In: Conference on Innovative Data Systems Research (2013)

    Google Scholar 

  31. Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers, San Rafael (2010)

    Book  Google Scholar 

  32. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. In: VLDB 2012 (2012)

    Article  Google Scholar 

  33. Malewicz, G., et al.: Pregel: a system for large-scale graph processing. In: ACM SIGMOD International Conference on Management of Data, pp. 135–146 (2010)

    Google Scholar 

  34. Meng, X., Bradley, J., Yavuz, B., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17, 1–7 (2016)

    MathSciNet  MATH  Google Scholar 

  35. NetCDF: Network Common Data Form. https://www.unidata.ucar.edu/software/netcdf/

  36. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-Foreign language for data processing. In: ACM SIGMOD International Conference on Management of Data (2008)

    Google Scholar 

  37. Papadopoulos, S., Datta, K., Madden, S., Mattson, T.: The TileDB array data storage manager. PVLDB 10(4), 349–360 (2016)

    Google Scholar 

  38. Soroush, E., Balazinska, M., Wang, D.: ArrayStore: a storage manager for complex parallel array processing. In: ACM SIGMOD International Conference on Management of Data (2011)

    Google Scholar 

  39. Soroush, E., Balazinska, M., Krughoff, S., Connolly, A.: Efficient iterative processing in the SciDB parallel array engine. In: 27th International Conference on Scientific and Statistical Database Management (SSDBM) (2015)

    Google Scholar 

  40. Shinnar, A., Cunningham, D., Herta, B., Saraswat, B.: M3R: Increased performance for in-memory Hadoop jobs. In: VLDB 2012 (2012)

    Article  Google Scholar 

  41. The SciDB Development Team. Overview of SciDB: large scale array storage, processing and analysis. In: ACM SIGMOD International Conference on Management of Data (2010)

    Google Scholar 

  42. Thusoo, A., et al.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. (PVLDB) 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  43. Thusoo, A., et al.: Hive: a petabyte scale data warehouse using hadoop. In: IEEE International Conference on Data Engineering (ICDE), pp. 996–1005 (2010)

    Google Scholar 

  44. Valiant, L.G.: A bridging model for parallel computation. CACM 33(8), 103–111 (1990)

    Article  Google Scholar 

  45. Wang, Y., Jiang, W., Agrawal, G.: SciMATE: a novel MapReduce-like framework for multiple scientific data formats. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (2012)

    Google Scholar 

  46. Yu, Y., et al.: DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In: Symposium on Operating Systems Design and Implementation (OSDI) (2008)

    Google Scholar 

Download references

Acknowledgments

Our performance evaluations were performed at the Chameleon cloud computing infrastructure, www.chameleoncloud.org, supported by NSF.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leonidas Fegaras .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer-Verlag GmbH Germany, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Fegaras, L. (2018). A Query Processing Framework for Large-Scale Scientific Data Analysis. In: Hameurlain, A., Wagner, R., Hartmann, S., Ma, H. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXVIII. Lecture Notes in Computer Science(), vol 11250. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-58384-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-58384-5_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-58383-8

  • Online ISBN: 978-3-662-58384-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics