Skip to main content

A Transformation-Based Approach to Developing High-Performance GPU Programs

  • Conference paper
  • First Online:
Perspectives of System Informatics (PSI 2017)

Abstract

We advocate the use of formal patterns and transformations for programming modern many-core processors like Graphics Processing Units (GPU), as an alternative to the currently used low-level, ad hoc programming approaches like CUDA or OpenCL. Our new contribution is introducing an intermediate level of low-level patterns in order to bridge the abstraction gap between popular high-level patterns (\({map}\), fold/reduce, \({zip}\), etc.) and imperative, executable code for many-cores. We define our low-level patterns based on the OpenCL programming model which is portable across parallel architectures of different vendors, and we introduce semantics-preserving rewrite rules that transform programs with high-level patterns into programs with low-level patterns, from which executable OpenCL programs are automatically generated. We show that program design decisions and optimizations, which are usually applied ad-hoc by experts, are systematically expressed in our approach as provably-correct transformations for high- and low-level patterns. We evaluate our approach by systematically deriving several differently optimized OpenCL implementations of parallel reduction that achieve performance competitive with OpenCL programs which are manually written and highly tuned by performance experts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aldinucci, M., Danelutto, M., Kilpatrick, P., Torquati, M.: Fastflow: high-level and efficient streaming on multi-core. In: Programming Multi-core and Many-core Computing Systems. Wiley-Blackwell, Hoboken (2011)

    Google Scholar 

  2. AMD: Bolt C++ Template Library

    Google Scholar 

  3. Backus, J.: Can programming be liberated from the von Neumann style? A functional style and its algebra of programs. Commun. ACM 21(8), 613–641 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bird, R.S.: Algebraic identities for program calculation. Comput. J. 32(2), 122–126 (1989)

    Article  MathSciNet  Google Scholar 

  5. Burstall, R.M., Darlington, J.: A transformation system for developing recursive programs. J. ACM 24(1), 44–67 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  6. Chakravarty, M., Keller, G., Lee, S., McDonell, T.L., Grover, V.: Accelerating Haskell array codes with multicore GPUs. In: DAMP, pp. 3–14. ACM (2011)

    Google Scholar 

  7. Gorlatch, S., Cole, M.: Parallel skeletons. In: Padua, D. (ed.) Encyclopedia of Parallel Computing, pp. 1417–1422. Springer, Boston (2011). https://doi.org/10.1007/978-0-387-09766-4

    Google Scholar 

  8. Harris, M., et al.: Optimizing parallel reduction in CUDA. NVIDIA Developer Technol. 2(4), 1–39 (2007)

    Google Scholar 

  9. Holk, E., Byrd, W.E., Mahajan, N., Willcock, J., Chauhan, A., Lumsdaine, A.: Declarative parallel programming for GPUs. In: PARCO, pp. 297–304 (2011)

    Google Scholar 

  10. Khronos OpenCL Working Group: The OpenCL Specification

    Google Scholar 

  11. Kuchen, H.: A skeleton library. In: Monien, B., Feldmann, R. (eds.) Euro-Par 2002. LNCS, vol. 2400, pp. 620–629. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45706-2_86

    Chapter  Google Scholar 

  12. Nvidia: CUDA Basic Linear Algebra Subroutines (cuBLAS). Version 6.5

    Google Scholar 

  13. Steuwer, M., Fensch, C., Lindley, S., Dubach, C.: Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance openCL code. In: ICFP, pp. 205–217. ACM (2015)

    Google Scholar 

  14. Steuwer, M., Gorlatch, S.: High-level programming for medical imaging on multi-GPU systems using the skelCL library. In: Procedia Computer Science, ICCS, vol. 18, pp. 749–758. Elsevier (2013)

    Google Scholar 

  15. Steuwer, M., Kegel, P., Gorlatch, S.: SkelCL: a portable skeleton library for high-level GPU programming. In: HIPS @ IPDPS, pp. 1176–1182. IEEE (2011)

    Google Scholar 

  16. Steuwer, M., Remmelg, T., Dubach, C.: Lift: a functional data-parallel IR for high-performance GPU code generation. In: CGO, pp. 74–85. ACM (2017)

    Google Scholar 

  17. Svensson, J., Sheeran, M., Claessen, K.: Obsidian: a domain specific embedded language for parallel programming of graphics processors. In: Scholz, S.-B., Chitil, O. (eds.) IFL 2008. LNCS, vol. 5836, pp. 156–173. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24452-0_9

    Chapter  Google Scholar 

Download references

Acknowledgments

This work was supported by the German Research Council (DFG) within the Cluster of Excellence CiM (University of Münster), by the German Ministry of Education and Research (BMBF) within the project HPC\(^2\)SE, and by a EuroLab-4-HPC collaboration. We thank Nvidia for their generous hardware donation used in our experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bastian Hagedorn .

Editor information

Editors and Affiliations

Appendices

Appendix

A Additional Rewrite Rules

figure e

B Proof of a Rewrite Rule

Rewrite rules are proved using equational reasoning. As an example we prove rule (25) which introduces layers in the computation hierarchy of a reduction: first a partial reduction is computed, followed by a reduction combining all temporary results.

Proof

(Reduce-Promotion Variant). Let n be a number divisible by m.

C Derived Low-Level Reduction Programs

Fig. 8.
figure 8

Two more low-level programs implementing parallel reduction. They are equivalent to the fourth and the (seventh) most optimized version described in [8], correspondingly

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hagedorn, B., Steuwer, M., Gorlatch, S. (2018). A Transformation-Based Approach to Developing High-Performance GPU Programs. In: Petrenko, A., Voronkov, A. (eds) Perspectives of System Informatics. PSI 2017. Lecture Notes in Computer Science(), vol 10742. Springer, Cham. https://doi.org/10.1007/978-3-319-74313-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-74313-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-74312-7

  • Online ISBN: 978-3-319-74313-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics