Enhancing OpenMP Tasking Model: Performance and Portability

Yu, Chenle; Royuela, Sara; Quiñones, Eduardo

doi:10.1007/978-3-030-85262-7_3

Chenle Yu^11,12,
Sara Royuela¹¹ &
Eduardo Quiñones¹¹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 12870))

Included in the following conference series:

International Workshop on OpenMP

669 Accesses
4 Citations

Abstract

OpenMP, as the de-facto standard programming model in symmetric multiprocessing for HPC, has seen its performance boosted continuously by the community, either through implementation enhancements or specification augmentations. Furthermore, the language has evolved from a prescriptive nature, as defined by the thread-centric model, to a descriptive behavior, as defined by the task-centric model. However, the overhead related to the orchestration of tasks is still relatively high. Applications exploiting very fine-grained parallelism and systems with a large number of cores available might fail on scaling.

In this work, we propose to include the concept of Task Dependency Graph (TDG) in the specification by introducing a new clause, named taskgraph, attached to task or target directives. By design, the TDG allows alleviating the overhead associated with the OpenMP tasking model, and it also facilitates linking OpenMP with other programming models that support task parallelism. According to our experiments, a GCC implementation of the taskgraph is able to significantly reduce the execution time of fine-grained task applications and increase their scalability with regard to the number of threads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The execution has run in a node of the Marenostrum IV [1] supercomputer, equipped with an Intel Xeon Platinum 8160 CPU, having 2 sockets of 24 physical cores each, at 2.1 GHz and 33 MB L3 cache.

References

BSC: Marenostrum IV User’s Guide (2017). https://www.bsc.es/support/MareNostrum4-ug.pdf
Castello, A., Seo, S., Mayo, R., Balaji, P., Quintana-Orti, E.S., Pena, A.J.: GLTO: on the adequacy of lightweight thread approaches for openmp implementations. In: Proceedings of the International Conference on Parallel Processing, pp. 60–69 (2017)
Google Scholar
Gautier, T., Perez, C., Richard, J.: On the impact of OpenMP task granularity. In: de Supinski, B.R., Valero-Lara, P., Martorell, X., Mateo Bellido, S., Labarta, J. (eds.) IWOMP 2018. LNCS, vol. 11128, pp. 205–221. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98521-3_14
Chapter Google Scholar
Giannozzi, P., et al.: Quantum espresso: a modular and open-source software project for quantum simulations of materials. J. Phys. Condens. Matter 21(39), 395502 (2009)
Google Scholar
Kalray MPPA products (2021). https://www.kalrayinc.com/
Komatitsch, D., Tromp, J.: SPECFEM3D Cartesian (2021). https://github.com/geodynamics/specfem3d
Kukanov, A., Voss, M.J.: The foundations for scalable multi-core software in intel threading building blocks. Intel Technol. J. 11(4), 309–322 (2007). http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=79B311F4CEB9A4B610520177C7144D57?doi=10.1.1.71.8289&rep=rep1&type=pdf
Lagrone, J., Aribuki, A., Chapman, B.: A set of microbenchmarks for measuring OpenMP task overheads. In: Proceedingis of International Conference on Parallel and Distributed Processing Techniques and Applications II, pp. 594–600 (2011). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.217.9615&rep=rep1&type=pdf
Leiserson, C.E.: The Cilk++ concurrency platform. J. Supercomput. 51(3), 244–257 (2010)
Article Google Scholar
Munera, A., Royuela, S., Quinones, E.: Towards a qualifiable OpenMP framework for embedded systems. In: Proceedings of the 2020 Design, Automation and Test in Europe Conference and Exhibition, DATE 2020, no. 2, pp. 903–908 (2020)
Google Scholar
Nvidia: CUDA Graph programming guide (2021). https://docs.nvidia.com/cuda/cuda-c-programming-guide/#cuda-graphs
Olivier, S.L., Prins, J.F.: Evaluating OpenMP 3.0 run time systems on unbalanced task graphs. In: Müller, M.S., de Supinski, B.R., Chapman, B.M. (eds.) IWOMP 2009. LNCS, vol. 5568, pp. 63–78. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02303-3_6
Chapter Google Scholar
Perez, J.M., Beltran, V., Labarta, J., Ayguade, E.: Improving the integration of task nesting and dependencies in OpenMP. In: Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017, pp. 809–818 (2017)
Google Scholar
Sainz, F., Mateo, S., Beltran, V., Bosque, J.L., Martorell, X., Ayguadé, E.: Leveraging OmpSs to exploit hardware accelerators. In: 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, pp. 112–119. IEEE (2014)
Google Scholar
Schuchart, J., Nachtmann, M., Gracia, J.: Patterns for OpenMP task data dependency overhead measurements. In: de Supinski, B.R., Olivier, S.L., Terboven, C., Chapman, B.M., Müller, M.S. (eds.) Scaling OpenMP for Exascale Performance and Portability, pp. 156–168. Springer International Publishing, Cham (2017)
Chapter Google Scholar
Serrano, M.A., Melani, A., Vargas, R., Marongiu, A., Bertogna, M., Quiñones, E.: Timing characterization of OpenMP4 tasking model. In: 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES 2015, pp. 157–166 (2015)
Google Scholar
Stpiczyński, P.: Language-based vectorization and parallelization using intrinsics, openmp, tbb and cilk plus. J. Supercomput. 74(4), 1461–1472 (2018)
Article Google Scholar
TOP500 (2020). https://www.top500.org/lists/top500/2020/11/
Valero-Lara, P., Catalán, S., Martorell, X., Usui, T., Labarta, J.: sLASs: a fully automatic auto-tuned linear algebra library based on openmp extensions implemented in ompss (lass library). J. Parallel Distrib. Comput. 138, 153–171 (2020)
Article Google Scholar
Yu, C., Royuela, S., Quiñones, E.: OpenMP to CUDA graphs: a compiler-based transformation to enhance the programmability of NVIDIA devices. In: Proceedings of the 23rd International Workshop on Software and Compilers for Embedded Systems, SCOPES 2020, pp. 42–47 (2020)
Google Scholar

Download references

Acknowledgements

This work has been supported by the EU H2020 project AMPERE under the grant agreement no. 871669.

Author information

Authors and Affiliations

Barcelona Supercomputing Center, Barcelona, Spain
Chenle Yu, Sara Royuela & Eduardo Quiñones
Universitat Politècnica de Catalunya, Barcelona, Spain
Chenle Yu

Authors

Chenle Yu
View author publications
You can also search for this author in PubMed Google Scholar
Sara Royuela
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo Quiñones
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chenle Yu , Sara Royuela or Eduardo Quiñones .

Editor information

Editors and Affiliations

University of Bristol, Bristol, UK
Simon McIntosh-Smith
Lawrence Livermore National Laboratory, Livermore, CA, USA
Bronis R. de Supinski
RWTH Aachen University, Aachen, Germany
Jannis Klinkenberg

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, C., Royuela, S., Quiñones, E. (2021). Enhancing OpenMP Tasking Model: Performance and Portability. In: McIntosh-Smith, S., de Supinski, B.R., Klinkenberg, J. (eds) OpenMP: Enabling Massive Node-Level Parallelism. IWOMP 2021. Lecture Notes in Computer Science(), vol 12870. Springer, Cham. https://doi.org/10.1007/978-3-030-85262-7_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-85262-7_3
Published: 08 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85261-0
Online ISBN: 978-3-030-85262-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics