Parametric GPU Code Generation for Affine Loop Programs

Konstantinidis, Athanasios; Kelly, Paul H. J.; Ramanujam, J.; Sadayappan, P.

doi:10.1007/978-3-319-09967-5_8

Athanasios Konstantinidis¹⁷,
Paul H. J. Kelly¹⁷,
J. Ramanujam¹⁸ &
…
P. Sadayappan¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8664))

Included in the following conference series:

International Workshop on Languages and Compilers for Parallel Computing

Abstract

Partitioning a parallel computation into finite-sized chunks for effective mapping onto a parallel machine is a critical concern for source-to-source compilation. In the context of OpenCL and CUDA, this translates to the definition of a uniform hyper-rectangular partitioning of the parallel execution space where each partition is subject to a fine-grained distribution of resources that has a direct yet hard to estimate impact on performance. This paper develops the first compilation scheme for generating parametrically tiled codes for affine loop programs on GPUs, which facilitates run-time exploration of partitioning parameters as a fast and portable way of finding the ones that yield maximum performance. Our approach is based on a parametric tiling scheme for producing wavefronts of parallel rectangular partitions of parametric size and a novel runtime system that manages wavefront execution and local memory usage dynamically through an inspector-executor mechanism. An experimental evaluation demonstrates the effectiveness of our approach for wavefront as well as rectangularly-parallel partitionings.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In order to facilitate the generality of the definitions presented and used in the rest of the paper, the OpenCL terminology will be primarily adopted.
2.
A floor operator returns the largest integer that is not greater than the actual result of the fraction.
3.
Currently supporting CUDA targets.
4.
http://www.doc.ic.ac.uk/~phjk/LCPC13
5.
We used tile sizes ranging from 8 to 32 with a stride of 4 while on Jacobi-1d we searched up to time tile sizes of 256.

References

Aho, A., Lam, M., Sethi, R., Ullman, J.: Optimizing for parallelism and locality. In: Compilers: Principles, Techniques, and Tools. Pearson/Addison Wesley, Boston (2007)
Google Scholar
Allen, R., Kennedy, K.: Automatic translation of fortran programs to vector form. ACM Trans. Program. Lang. Syst. (TOPLAS) 9(4), 491–542 (1987)
Article MATH Google Scholar
Ancourt, C., Irigoin, F.: Scanning polyhedra with DO loops. In: ACM Sigplan Notices, vol. 26, pp. 39–50. ACM (1991)
Google Scholar
Baskaran, M.M., Hartono, A., Tavarageri, S., Henretty, T., Ramanujam, J., Sadayappan, P.: Parameterized tiling revisited. In: CGO. ACM (2010)
Google Scholar
Baskaran, M.M., Ramanujam, J., Sadayappan, P.: Automatic C-to-CUDA code generation for affine programs. In: Gupta, R. (ed.) CC 2010. LNCS, vol. 6011, pp. 244–263. Springer, Heidelberg (2010)
Chapter Google Scholar
Bastoul, C.: Code generation in the polyhedral model is easier than you think. In: PACT (2004)
Google Scholar
Bastoul, C., Feautrier, P.: Improving data locality by chunking. In: Hedin, G. (ed.) CC 2003. LNCS, vol. 2622, pp. 320–334. Springer, Heidelberg (2003)
Chapter Google Scholar
Bondhugula, U., Hartono, A., Ramanujam, J., Sadayappan, P.: A practical automatic polyhedral parallelizer and locality optimizer. In: PLDI. ACM (2008)
Google Scholar
Feautrier, P.: Some efficient solutions to the affine scheduling problem. Part i. One-dimensional time. Int. J. Parallel Prog. 21(5), 313–347 (1992)
Article MathSciNet MATH Google Scholar
Feautrier, P.: Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. Int. J. Parallel Prog. 21(6), 389–420 (1992)
Article MathSciNet MATH Google Scholar
Grosser, T., Cohen, A., Kelly, P.H., Ramanujam, J., Sadayappan, P., Verdoolaege, S.: Split tiling for GPUs: automatic parallelization using trapezoidal tiles. In: GPGPU. ACM (2013)
Google Scholar
Hartono, A., Baskaran, M.M., Bastoul, C., Cohen, A., Krishnamoorthy, S., Norris, B., Ramanujam, J., Sadayappan, P.: Parametric multi-level tiling of imperfectly nested loops. In: Supercomputing, pp. 147–157. ACM (2009)
Google Scholar
Hartono, A., Baskaran, M.M., Ramanujam, J., Sadayappan, P.: DynTile: parametric tiled loop generation for parallel execution on multicore processors. In: IPDPS. IEEE (2010)
Google Scholar
Holewinski, J., Pouchet, L.N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 311–320. ACM (2012)
Google Scholar
Irigoin, F., Triolet, R.: Supernode partitioning. In: POPL. ACM (1988)
Google Scholar
Kim, D., Rajopadhye, S.: Parameterized Tiling for Imperfectly Nested Loops
Google Scholar
Kim, D., Renganarayanan, L., Rostron, D., Rajopadhye, S., Strout, M.M.: Multi-level tiling: M for the price of one. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, p. 51. ACM (2007)
Google Scholar
Krishnamoorthy, S., Baskaran, M., Bondhugula, U., Ramanujam, J., Rountev, A., Sadayappan, P.: Effective automatic parallelization of stencil computations. In: ACM Sigplan Notices, vol. 42, pp. 235–244. ACM (2007)
Google Scholar
Meng, J., Skadron, K.: Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs. In: Supercomputing. ACM (2009)
Google Scholar
Renganarayanan, L., Kim, D., Rajopadhye, S., Strout, M.M.: Parameterized tiled loops for free. ACM SIGPLAN Not. 42(6), 405–414 (2007)
Article Google Scholar
Rudy, G., Khan, M.M., Hall, M., Chen, C., Chame, J.: A programming language interface to describe transformations and code generation. In: Cooper, K., Mellor-Crummey, J., Sarkar, V. (eds.) LCPC 2010. LNCS, vol. 6548, pp. 136–150. Springer, Heidelberg (2011)
Chapter Google Scholar
Ruetsch, G., Micikevicius, P.: Optimizing matrix transpose in CUDA. NVIDIA CUDA SDK Application Note (2009)
Google Scholar
Verdoolaege, S.: An integer set library for the polyhedral model. In: Fukuda, K., Hoeven, J., Joswig, M., Takayama, N. (eds.) ICMS 2010. LNCS, vol. 6327, pp. 299–302. Springer, Heidelberg (2010)
Chapter Google Scholar
Verdoolaege, S., Juega, J.C., Cohen, A., Gómez, J.I., Tenllado, C., Catthoor, F.: Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. (TACO) 9(4), 54 (2013)
Google Scholar
Wolfe, M.: Loops skewing: the wavefront method revisited. Int. J. Parallel Prog. 15(4), 279–293 (1986)
Article MATH Google Scholar
Wolfe, M.: More iteration space tiling. In: Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, pp. 655–664. ACM (1989)
Google Scholar
Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. In: ACM Sigplan Notices, vol. 45, pp. 86–97. ACM (2010)
Google Scholar

Download references

Acknowledgments

This work was supported in part by the U.S. National Science Foundation through awards 0811457, 0904549, 1059417 and 1205682. The authors would also like to thank Codeplay Software and EPSRC for their support as well as Louis-Noël Pouchet and Sanket Tavarageri for their valuable contributions.

Author information

Authors and Affiliations

Imperial College London, London, UK
Athanasios Konstantinidis & Paul H. J. Kelly
Louisiana State University, Baton Rouge, USA
J. Ramanujam
The Ohio State University, Columbus, USA
P. Sadayappan

Authors

Athanasios Konstantinidis
View author publications
You can also search for this author in PubMed Google Scholar
Paul H. J. Kelly
View author publications
You can also search for this author in PubMed Google Scholar
J. Ramanujam
View author publications
You can also search for this author in PubMed Google Scholar
P. Sadayappan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Athanasios Konstantinidis .

Editor information

Editors and Affiliations

Silicon Valley, Qualcomm Research, San Jose, California, USA
Călin Cașcaval
Silicon Valley, Qualcomm Research, San Jose, California, USA
Pablo Montesinos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Konstantinidis, A., Kelly, P.H.J., Ramanujam, J., Sadayappan, P. (2014). Parametric GPU Code Generation for Affine Loop Programs. In: Cașcaval, C., Montesinos, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2013. Lecture Notes in Computer Science(), vol 8664. Springer, Cham. https://doi.org/10.1007/978-3-319-09967-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-09967-5_8
Published: 01 October 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09966-8
Online ISBN: 978-3-319-09967-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics