Skip to main content

Parametric GPU Code Generation for Affine Loop Programs

  • Conference paper
  • First Online:
Languages and Compilers for Parallel Computing (LCPC 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8664))

Abstract

Partitioning a parallel computation into finite-sized chunks for effective mapping onto a parallel machine is a critical concern for source-to-source compilation. In the context of OpenCL and CUDA, this translates to the definition of a uniform hyper-rectangular partitioning of the parallel execution space where each partition is subject to a fine-grained distribution of resources that has a direct yet hard to estimate impact on performance. This paper develops the first compilation scheme for generating parametrically tiled codes for affine loop programs on GPUs, which facilitates run-time exploration of partitioning parameters as a fast and portable way of finding the ones that yield maximum performance. Our approach is based on a parametric tiling scheme for producing wavefronts of parallel rectangular partitions of parametric size and a novel runtime system that manages wavefront execution and local memory usage dynamically through an inspector-executor mechanism. An experimental evaluation demonstrates the effectiveness of our approach for wavefront as well as rectangularly-parallel partitionings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In order to facilitate the generality of the definitions presented and used in the rest of the paper, the OpenCL terminology will be primarily adopted.

  2. 2.

    A floor operator returns the largest integer that is not greater than the actual result of the fraction.

  3. 3.

    Currently supporting CUDA targets.

  4. 4.

    http://www.doc.ic.ac.uk/~phjk/LCPC13

  5. 5.

    We used tile sizes ranging from 8 to 32 with a stride of 4 while on Jacobi-1d we searched up to time tile sizes of 256.

References

  1. Aho, A., Lam, M., Sethi, R., Ullman, J.: Optimizing for parallelism and locality. In: Compilers: Principles, Techniques, and Tools. Pearson/Addison Wesley, Boston (2007)

    Google Scholar 

  2. Allen, R., Kennedy, K.: Automatic translation of fortran programs to vector form. ACM Trans. Program. Lang. Syst. (TOPLAS) 9(4), 491–542 (1987)

    Article  MATH  Google Scholar 

  3. Ancourt, C., Irigoin, F.: Scanning polyhedra with DO loops. In: ACM Sigplan Notices, vol. 26, pp. 39–50. ACM (1991)

    Google Scholar 

  4. Baskaran, M.M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: CGO. ACM (2010)

    Google Scholar 

  5. Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA code generation for affine programs. In: Gupta, R. (ed.) CC 2010. LNCS, vol. 6011, pp. 244–263. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  6. Bastoul, C.: Code generation in the polyhedral model is easier than you think. In: PACT (2004)

    Google Scholar 

  7. Bastoul, C., Feautrier, P.: Improving data locality by chunking. In: Hedin, G. (ed.) CC 2003. LNCS, vol. 2622, pp. 320–334. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  8. Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: PLDI. ACM (2008)

    Google Scholar 

  9. Feautrier, P.: Some efficient solutions to the affine scheduling problem. Part i. One-dimensional time. Int. J. Parallel Prog. 21(5), 313–347 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  10. Feautrier, P.: Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. Int. J. Parallel Prog. 21(6), 389–420 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  11. Grosser, T., Cohen, A., Kelly, P.H., Ramanujam, J., Sadayappan, P., Verdoolaege, S.: Split tiling for GPUs: automatic parallelization using trapezoidal tiles. In: GPGPU. ACM (2013)

    Google Scholar 

  12. Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Supercomputing, pp. 147–157. ACM (2009)

    Google Scholar 

  13. Hartono, A., Baskaran, M.M., Ramanujam, J., Sadayappan, P.: DynTile: parametric tiled loop generation for parallel execution on multicore processors. In: IPDPS. IEEE (2010)

    Google Scholar 

  14. Holewinski, J., Pouchet, L.N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 311–320. ACM (2012)

    Google Scholar 

  15. Irigoin, F., Triolet, R.: Supernode partitioning. In: POPL. ACM (1988)

    Google Scholar 

  16. Kim, D., Rajopadhye, S.: Parameterized Tiling for Imperfectly Nested Loops

    Google Scholar 

  17. Kim, D., Renganarayanan, L., Rostron, D., Rajopadhye, S., Strout, M.M.: Multi-level tiling: M for the price of one. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, p. 51. ACM (2007)

    Google Scholar 

  18. Krishnamoorthy, S., Baskaran, M., Bondhugula, U., Ramanujam, J., Rountev, A., Sadayappan, P.: Effective automatic parallelization of stencil computations. In: ACM Sigplan Notices, vol. 42, pp. 235–244. ACM (2007)

    Google Scholar 

  19. Meng, J., Skadron, K.: Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In: Supercomputing. ACM (2009)

    Google Scholar 

  20. Renganarayanan, L., Kim, D., Rajopadhye, S., Strout, M.M.: Parameterized tiled loops for free. ACM SIGPLAN Not. 42(6), 405–414 (2007)

    Article  Google Scholar 

  21. Rudy, G., Khan, M.M., Hall, M., Chen, C., Chame, J.: A programming language interface to describe transformations and code generation. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. LNCS, vol. 6548, pp. 136–150. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  22. Ruetsch, G., Micikevicius, P.: Optimizing matrix transpose in CUDA. NVIDIA CUDA SDK Application Note (2009)

    Google Scholar 

  23. Verdoolaege, S.: An integer set library for the polyhedral model. In: Fukuda, K., Hoeven, J., Joswig, M., Takayama, N. (eds.) ICMS 2010. LNCS, vol. 6327, pp. 299–302. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  24. Verdoolaege, S., Juega, J.C., Cohen, A., Gómez, J.I., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. (TACO) 9(4), 54 (2013)

    Google Scholar 

  25. Wolfe, M.: Loops skewing: the wavefront method revisited. Int. J. Parallel Prog. 15(4), 279–293 (1986)

    Article  MATH  Google Scholar 

  26. Wolfe, M.: More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pp. 655–664. ACM (1989)

    Google Scholar 

  27. Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. In: ACM Sigplan Notices, vol. 45, pp. 86–97. ACM (2010)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by the U.S. National Science Foundation through awards 0811457, 0904549, 1059417 and 1205682. The authors would also like to thank Codeplay Software and EPSRC for their support as well as Louis-Noël Pouchet and Sanket Tavarageri for their valuable contributions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Athanasios Konstantinidis .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Konstantinidis, A., Kelly, P.H.J., Ramanujam, J., Sadayappan, P. (2014). Parametric GPU Code Generation for Affine Loop Programs. In: Cașcaval, C., Montesinos, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2013. Lecture Notes in Computer Science(), vol 8664. Springer, Cham. https://doi.org/10.1007/978-3-319-09967-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09967-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09966-8

  • Online ISBN: 978-3-319-09967-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics