IDCOS: optimization strategy for parallel complex expression computation on big data

Song, Yang; Jin, Helin; Wang, Hongzhi; Liu, You

doi:10.1007/s11227-021-03674-y

IDCOS: optimization strategy for parallel complex expression computation on big data

Published: 04 March 2021

Volume 77, pages 10334–10356, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yang Song¹,
Helin Jin¹,
Hongzhi Wang ORCID: orcid.org/0000-0002-7521-2871¹ &
…
You Liu¹

239 Accesses
Explore all metrics

Abstract

Complex expressions are the basis of data analytics. To process complex expressions on big data efficiently, we developed a novel optimization strategy for parallel computation platforms such as Hadoop and Spark. We attempted to minimize the rounds of data repartition to achieve high performance. Aiming at this goal, we modeled the expression as a graph and developed a simplification algorithm for this graph. Based on the graph, we converted the round minimization problem into a graph decomposition problem and developed a linear algorithm for it. We also designed appropriated implementation for the optimization strategy. Extensive experimental results demonstrate that the proposed approach could optimize the computation of complex expressions effectively with small cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Big data analytics on Apache Spark

Article 13 October 2016

Performance improvement of the triangular matrix product in commodity clusters

Article Open access 15 April 2024

Parallelizing the dual revised simplex method

Article Open access 14 December 2017

References

Aho AV, Sethi R, Ullman JD (1986) Compilers, principles, techniques. Addison Wesley 7(8):9
MATH Google Scholar
Althebyan Q, Jararweh Y, Yaseen Q, Alqudah O, Al-Ayyoub M (2016) Evaluating map reduce tasks scheduling algorithms over cloud computing infrastructure. Concurr Comput Pract Exp 27(18):5686–5699
Article Google Scholar
Baaziz A, Quoniam L (2014) How to use big data technologies to optimize operations in upstream petroleum industry. CoRR. arXiv:abs/1412.0755
Bu Y, Howe B, Balazinska M (2010) HaLoop: efficient iterative data processing on large clusters. PVLDB 3(1):285–296. https://doi.org/10.14778/1920841.1920881
Article Google Scholar
Church K, Gale W, Hanks P, Hindle D (1991) Using statistics in lexical analysis. Lexical acquisition: exploiting on-line resources to build a lexicon 115:164
Cohen J, Dolan B, Dunlap M, Hellerstein JM, Welton C (2009) MAD skills: new analysis practices for big data. PVLDB 2(2):1481–1492. https://doi.org/10.14778/1687553.1687576
Article Google Scholar
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: 6th Symposium on Operating System Design and Implementation (OSDI 2004), San Francisco, California, USA, December 6–8, 2004, pp 137–150. http://www.usenix.org/events/osdi04/tech/dean.html
Dodhia RM (2005) A review of applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). J Educ Behav Stat 30(2):227–229
Article Google Scholar
Dorre J, Apel S, Lengauer C (2015) Modeling and optimizing MapReduce programs. Concurr Comput Pract Exp 27(7):1734–1766
Article Google Scholar
Hsieh K (2019) Machine learning systems for highly-distributed and rapidly-growing data. CoRR. arXiv:abs/1910.08663
Huang Y, Shi Y, Zhong Z, Feng Y, Cheng J, Li J, Fan H, Li C, Guan T, Zhou J (2019) Yugong: geo-distributed data and job placement at scale. Proc. VLDB Endow. 12(12):2155–2169. https://doi.org/10.14778/3352063.3352132
Article Google Scholar
Idcos code. https://github.com/hoverwinter/IDCOS
Idris M, Hussain S, Ali M, Abdulali A, Siddiqi MH, Kang BH, Lee S (2015) Context-aware scheduling in MapReduce: a compact review. Concurr Comput Pract Exp 27(17):5332–5349
Article Google Scholar
Jaggi M, Smith V, Takác M, Terhorst J, Krishnan S, Hofmann T, Jordan MI (2014) Communication-efficient distributed dual coordinate ascent. In: Advances in Neural Information Processing Systems, pp 3068–3076
Kloudas K, Rodrigues R, Preguiça NM, Mamede M (2015) PIXIDA: optimizing data parallel jobs in wide-area data analytics. PVLDB 9(2):72–83. https://doi.org/10.14778/2850578.2850582
Article Google Scholar
Lin J, Schatz M (2010) Design patterns for efficient graph algorithms in mapreduce. In: Proceedings of the Eighth Workshop on Mining and Learning with Graphs, MLG’10, Washington, D.C., USA, July 24–25, 2010, pp 78–85. https://doi.org/10.1145/1830252.1830263
Liu Y, Jing W, Liu Y, Lv L, Qi M, Xiang Y (2017) A sliding window-based dynamic load balancing for heterogeneous Hadoop clusters. Concurr Comput Pract Exp 29(3), n/a–n/a
Ma C, Smith V, Jaggi M, Jordan MI, Richtárik P, Takáč M (2015) Adding vs. averaging in distributed primal-dual optimization. ArXiv preprint arXiv:1502.03508
Mahout. http://mahout.apache.org
Marzuni SM, Savadi A, Toosi AN, Naghibzadeh M (2021) Cross-MapReduce: data transfer reduction in geo-distributed MapReduce. Future Gener Comput Syst 115:188–200. https://doi.org/10.1016/j.future.2020.09.009
Article Google Scholar
R language. https://www.r-project.org/about.html
Recht B, Re C, Wright S, Niu F (2011) Hogwild!: a lock-free approach to parallelizing stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp 693–701
Sangat P, Indrawan-Santiago M, Taniar D (2018) Sensor data management in the cloud: data storage, data ingestion, and data retrieval. Concurr Comput Pract Exp 30(1). https://doi.org/10.1002/cpe.4354
Segal B, Robertson L, Gagliardi F, Carminati F (2000) Grid computing: the European data grid project. In: Nuclear Science Symposium Conference Record, vol 1, p 2/1
Torma B, Boglárka G (2010) An efficient descent direction method with cutting planes. Central Eur J Oper Res 18(2):105–130
Article MathSciNet Google Scholar
Wang G, Venkataraman S, Phanishayee A, Thelin J, Devanur N, Stoica I (2019) Blink: fast and generic collectives for distributed ml. ArXiv preprint arXiv:1910.04940
White T (2009) Hadoop—the definitive guide: MapReduce for the cloud. O’Reilly. http://www.oreilly.de/catalog/9780596521974/index.html
Yao H, Xu J, Luo Z, Zeng D (2016) MEMoMR: accelerate MapReduce via reuse of intermediate results. Concurr Comput Pract Exp 28(14):3814–3829
Article Google Scholar
Yu P, Chowdhury M (2019) Salus: fine-grained GPU sharing primitives for deep learning applications. ArXiv preprint arXiv:1902.04610
Yuan K, Ying B, Liu J, Sayed AH (2018) Variance-reduced stochastic learning by networked agents under random reshuffling. IEEE Trans Signal Process 67(2):351–366
Article MathSciNet Google Scholar
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25–27, 2012, pp 15–28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud’10, Boston, MA, USA, June 22, 2010. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets

Download references

Acknowledgements

This paper was supported by NSFC grant U1866602.

Author information

Authors and Affiliations

Harbin Institute of Technology, P.O.Box 750, Harbin, China
Yang Song, Helin Jin, Hongzhi Wang & You Liu

Authors

Yang Song
View author publications
You can also search for this author in PubMed Google Scholar
Helin Jin
View author publications
You can also search for this author in PubMed Google Scholar
Hongzhi Wang
View author publications
You can also search for this author in PubMed Google Scholar
You Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hongzhi Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Song, Y., Jin, H., Wang, H. et al. IDCOS: optimization strategy for parallel complex expression computation on big data. J Supercomput 77, 10334–10356 (2021). https://doi.org/10.1007/s11227-021-03674-y

Download citation

Accepted: 05 February 2021
Published: 04 March 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11227-021-03674-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

IDCOS: optimization strategy for parallel complex expression computation on big data

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Performance improvement of the triangular matrix product in commodity clusters

Parallelizing the dual revised simplex method

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

IDCOS: optimization strategy for parallel complex expression computation on big data

Abstract

Access this article

Similar content being viewed by others

Big data analytics on Apache Spark

Performance improvement of the triangular matrix product in commodity clusters

Parallelizing the dual revised simplex method

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation