Skip to main content
Log in

IDCOS: optimization strategy for parallel complex expression computation on big data

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Complex expressions are the basis of data analytics. To process complex expressions on big data efficiently, we developed a novel optimization strategy for parallel computation platforms such as Hadoop and Spark. We attempted to minimize the rounds of data repartition to achieve high performance. Aiming at this goal, we modeled the expression as a graph and developed a simplification algorithm for this graph. Based on the graph, we converted the round minimization problem into a graph decomposition problem and developed a linear algorithm for it. We also designed appropriated implementation for the optimization strategy. Extensive experimental results demonstrate that the proposed approach could optimize the computation of complex expressions effectively with small cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Aho AV, Sethi R, Ullman JD (1986) Compilers, principles, techniques. Addison Wesley 7(8):9

    MATH  Google Scholar 

  2. Althebyan Q, Jararweh Y, Yaseen Q, Alqudah O, Al-Ayyoub M (2016) Evaluating map reduce tasks scheduling algorithms over cloud computing infrastructure. Concurr Comput Pract Exp 27(18):5686–5699

    Article  Google Scholar 

  3. Baaziz A, Quoniam L (2014) How to use big data technologies to optimize operations in upstream petroleum industry. CoRR. arXiv:abs/1412.0755

  4. Bu Y, Howe B, Balazinska M (2010) HaLoop: efficient iterative data processing on large clusters. PVLDB 3(1):285–296. https://doi.org/10.14778/1920841.1920881

    Article  Google Scholar 

  5. Church K, Gale W, Hanks P, Hindle D (1991) Using statistics in lexical analysis. Lexical acquisition: exploiting on-line resources to build a lexicon 115:164

  6. Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C (2009) MAD skills: new analysis practices for big data. PVLDB 2(2):1481–1492. https://doi.org/10.14778/1687553.1687576

    Article  Google Scholar 

  7. Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, December 6–8, 2004, pp 137–150. http://www.usenix.org/events/osdi04/tech/dean.html

  8. Dodhia RM (2005) A review of applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). J Educ Behav Stat 30(2):227–229

    Article  Google Scholar 

  9. Dorre J, Apel S, Lengauer C (2015) Modeling and optimizing MapReduce programs. Concurr Comput Pract Exp 27(7):1734–1766

    Article  Google Scholar 

  10. Hsieh K (2019) Machine learning systems for highly-distributed and rapidly-growing data. CoRR. arXiv:abs/1910.08663

  11. Huang Y, Shi Y, Zhong Z, Feng Y, Cheng J, Li J, Fan H, Li C, Guan T, Zhou J (2019) Yugong: geo-distributed data and job placement at scale. Proc. VLDB Endow. 12(12):2155–2169. https://doi.org/10.14778/3352063.3352132

    Article  Google Scholar 

  12. Idcos code. https://github.com/hoverwinter/IDCOS

  13. Idris M, Hussain S, Ali M, Abdulali A, Siddiqi MH, Kang BH, Lee S (2015) Context-aware scheduling in MapReduce: a compact review. Concurr Comput Pract Exp 27(17):5332–5349

    Article  Google Scholar 

  14. Jaggi M, Smith V, Takác M, Terhorst J, Krishnan S, Hofmann T, Jordan MI (2014) Communication-efficient distributed dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp 3068–3076

  15. Kloudas K, Rodrigues R, Preguiça NM, Mamede M (2015) PIXIDA: optimizing data parallel jobs in wide-area data analytics. PVLDB 9(2):72–83. https://doi.org/10.14778/2850578.2850582

    Article  Google Scholar 

  16. Lin J, Schatz M (2010) Design patterns for efficient graph algorithms in mapreduce. In: Proceedings of the Eighth Workshop on Mining and Learning with Graphs, MLG’10, Washington, D.C., USA, July 24–25, 2010, pp 78–85. https://doi.org/10.1145/1830252.1830263

  17. Liu Y, Jing W, Liu Y, Lv L, Qi M, Xiang Y (2017) A sliding window-based dynamic load balancing for heterogeneous Hadoop clusters. Concurr Comput Pract Exp 29(3), n/a–n/a

  18. Ma C, Smith V, Jaggi M, Jordan MI, Richtárik P, Takáč M (2015) Adding vs. averaging in distributed primal-dual optimization. ArXiv preprint arXiv:1502.03508

  19. Mahout. http://mahout.apache.org

  20. Marzuni SM, Savadi A, Toosi AN, Naghibzadeh M (2021) Cross-MapReduce: data transfer reduction in geo-distributed MapReduce. Future Gener Comput Syst 115:188–200. https://doi.org/10.1016/j.future.2020.09.009

    Article  Google Scholar 

  21. R language. https://www.r-project.org/about.html

  22. Recht B, Re C, Wright S, Niu F (2011) Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp 693–701

  23. Sangat P, Indrawan-Santiago M, Taniar D (2018) Sensor data management in the cloud: data storage, data ingestion, and data retrieval. Concurr Comput Pract Exp 30(1). https://doi.org/10.1002/cpe.4354

  24. Segal B, Robertson L, Gagliardi F, Carminati F (2000) Grid computing: the European data grid project. In: Nuclear Science Symposium Conference Record, vol 1, p 2/1

  25. Torma B, Boglárka G (2010) An efficient descent direction method with cutting planes. Central Eur J Oper Res 18(2):105–130

    Article  MathSciNet  Google Scholar 

  26. Wang G, Venkataraman S, Phanishayee A, Thelin J, Devanur N, Stoica I (2019) Blink: fast and generic collectives for distributed ml. ArXiv preprint arXiv:1910.04940

  27. White T (2009) Hadoop—the definitive guide: MapReduce for the cloud. O’Reilly. http://www.oreilly.de/catalog/9780596521974/index.html

  28. Yao H, Xu J, Luo Z, Zeng D (2016) MEMoMR: accelerate MapReduce via reuse of intermediate results. Concurr Comput Pract Exp 28(14):3814–3829

    Article  Google Scholar 

  29. Yu P, Chowdhury M (2019) Salus: fine-grained GPU sharing primitives for deep learning applications. ArXiv preprint arXiv:1902.04610

  30. Yuan K, Ying B, Liu J, Sayed AH (2018) Variance-reduced stochastic learning by networked agents under random reshuffling. IEEE Trans Signal Process 67(2):351–366

    Article  MathSciNet  Google Scholar 

  31. Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25–27, 2012, pp 15–28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

  32. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets

Download references

Acknowledgements

This paper was supported by NSFC grant U1866602.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongzhi Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Song, Y., Jin, H., Wang, H. et al. IDCOS: optimization strategy for parallel complex expression computation on big data. J Supercomput 77, 10334–10356 (2021). https://doi.org/10.1007/s11227-021-03674-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-03674-y

Keywords

Navigation