Approaches for Task Affinity in OpenMP

Terboven, Christian; Hahnfeld, Jonas; Teruel, Xavier; Mateo, Sergi; Duran, Alejandro; Klemm, Michael; Olivier, Stephen L.; de Supinski, Bronis R.

doi:10.1007/978-3-319-45550-1_8

Christian Terboven¹⁶,
Jonas Hahnfeld¹⁶,
Xavier Teruel¹⁷,
Sergi Mateo¹⁷,
Alejandro Duran¹⁸,
Michael Klemm¹⁸,
Stephen L. Olivier¹⁹ &
…
Bronis R. de Supinski²⁰

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 9903))

Included in the following conference series:

International Workshop on OpenMP

1273 Accesses
4 Citations

Abstract

OpenMP tasking supports parallelization of irregular algorithms. Recent OpenMP specifications extended tasking to increase functionality and to support optimizations, for instance with the taskloop construct. However, task scheduling remains opaque, which leads to inconsistent performance on NUMA architectures. We assess design issues for task affinity and explore several approaches to enable it. We evaluate these proposals with implementations in the Nanos++ and LLVM OpenMP runtimes that improve performance up to 40 % and significantly reduce execution time variation.

The rights of this work are transferred to the extent transferable according to title 17 U.S.C. 105.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Future versions of OpenMP may support explicit memory affinity and thereby inhance the definition of a location.
2.
http://openmp.llvm.org/.
3.
Further information about the STREAM benchmark suite available at: http://www.cs.virginia.edu/stream/ref.html.

References

Acar, U.A., Blelloch, G.E., Blumofe, R.D.: The data locality of work stealing. In: Proceedings of the 12th ACM Symposium on Parallel Algorithms and Architectures, SPAA 2000, pp. 1–12. ACM (2000)
Google Scholar
Bull Atos Technologies: Bull Coherent Switch. http://support.bull.com/ols/product/platforms/hw-extremcomp/hw-bullx-sup-node. Accessed 25 May 2016
Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: Proceedings of the 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 1998, pp. 212–223. ACM (1998)
Google Scholar
Guo, Y., Zhao, J., Cave, V., Sarkar, V.: SLAW: a scalable locality-aware adaptive work-stealing scheduler for multi-core systems. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2010, pp. 341–342. ACM (2010)
Google Scholar
Huang, L., Jin, H., Yi, L., Chapman, B.M.: Enabling locality-aware computations in OpenMP. Sci. Program. 18(3–4), 169–181 (2010)
Google Scholar
Muddukrishna, A., Jonsson, P.A., Brorsson, M.: Locality-aware task scheduling and data distribution for OpenMP programs on NUMA systems and manycore processors. Sci. Program. 2015, 5:1–5:16 (2015)
Google Scholar
Olivier, S.L., de Supinski, B.R., Schulz, M., Prins, J.F.: Characterizing and mitigating work time inflation in task parallel programs. In: Proceedings of the 24th International Conference on High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 65:1–65:12. IEEE (2012)
Google Scholar
OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 3.0. http://www.openmp.org/
OpenMP Architecture Review Board: OpenMP Application Program Interface, Version 4.0. http://www.openmp.org/
Pilla, L.L., Ribeiro, C.P., Cordeiro, D., Bhatele, A., Navaux, P.O.A., Méhaut, J.F., Kalé, L.V.: Improving parallel system performance with a NUMA-aware load balancer. Technical reort TR-JLPC-11-02, INRIA-Illinois Joint Laboratory on Petascale Computing, Urbana, IL (2011). http://hdl.handle.net/2142/25911
Terboven, C., Schmidl, D., Cramer, T., an Mey, D.: Assessing OpenMP tasking implementations on NUMA architectures. In: Chapman, B.M., Massaioli, F., Müller, M.S., Rorro, M. (eds.) IWOMP 2012. LNCS, vol. 7312, pp. 182–195. Springer, Heidelberg (2012)
Chapter Google Scholar
Yan, Y., Zhao, J., Guo, Y., Sarkar, V.: Hierarchical place trees: a portable abstraction for task parallelism and data movement. In: Gao, G.R., Pollock, L.L., Cavazos, J., Li, X. (eds.) LCPC 2009. LNCS, vol. 5898, pp. 172–187. Springer, Heidelberg (2010)
Chapter Google Scholar
Ziakas, D., Baum, A., Maddox, R.A., Safranek, R.J.: Intel QuickPath interconnect architectural features supporting scalable system architectures. In: 2010 18th IEEE Symposium on High Performance Interconnects, pp. 1–6, August 2010
Google Scholar

Download references

Acknowledgement

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energys National Nuclear Security Administration under contract DE-AC04-94AL85000.

This work has been developed with the support of the grant SEV-2011-00067 of the Severo Ochoa Program, awarded by the Spanish Government, by the Spanish Ministry of Science and Innovation (TIN2015-65316-P, Computacion de Altas Prestaciones VII) and by the Intel-BSC Exascale Lab collaboration project.

Some of the experiments were performed with computing resources granted by JARA-HPC from RWTH Aachen University under project jara0001. Parts of this work were funded by the German Federal Ministry of Research and Education (BMBF) under grant numbers 01IH13008A(ELP).

Intel and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

* Other names and brands are the property of their respective owners.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Author information

Authors and Affiliations

Chair for High Performance Computing, IT Center, RWTH Aachen University, Aachen, Germany
Christian Terboven & Jonas Hahnfeld
Barcelona Supercomputing Center, Barcelona, Spain
Xavier Teruel & Sergi Mateo
Intel, Santa Clara, USA
Alejandro Duran & Michael Klemm
Center for Computing Research, Sandia National Laboratories, Albuquerque, USA
Stephen L. Olivier
Lawrence Livermore National Laboratory (LLNL), Livermore, USA
Bronis R. de Supinski

Authors

Christian Terboven
View author publications
You can also search for this author in PubMed Google Scholar
Jonas Hahnfeld
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Teruel
View author publications
You can also search for this author in PubMed Google Scholar
Sergi Mateo
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Duran
View author publications
You can also search for this author in PubMed Google Scholar
Michael Klemm
View author publications
You can also search for this author in PubMed Google Scholar
Stephen L. Olivier
View author publications
You can also search for this author in PubMed Google Scholar
Bronis R. de Supinski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Terboven .

Editor information

Editors and Affiliations

RIKEN AICS , Kobe, Japan
Naoya Maruyama
Lawrence Livermore National Laboratory , Livermore, California, USA
Bronis R. de Supinski
RIKEN AICS , Kobe, Japan
Mohamed Wahib

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Terboven, C. et al. (2016). Approaches for Task Affinity in OpenMP. In: Maruyama, N., de Supinski, B., Wahib, M. (eds) OpenMP: Memory, Devices, and Tasks. IWOMP 2016. Lecture Notes in Computer Science(), vol 9903. Springer, Cham. https://doi.org/10.1007/978-3-319-45550-1_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-45550-1_8
Published: 21 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45549-5
Online ISBN: 978-3-319-45550-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics