Abstract
With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel’s Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important application area — structured grid codes — and investigated techniques for ensuring performance portability across a diverse range of different, high-end many-core architectures. We chose three codes to investigate: a 3D lattice Boltzmann code (D3Q19 BGK), the CloverLeaf hydrodynamics mini application from Sandia’s Mantevo benchmark suite, and ROTORSIM, a production-quality structured grid, multiblock, compressible finite-volume CFD code. We have developed OpenCL versions of these codes in order to provide cross-platform functional portability, and compared the performance of the OpenCL versions of these structured grid codes to optimized versions on each platform, including hybrid OpenMP/MPI/AVX versions on CPUs and Xeon Phi, and CUDA versions on NVIDIA GPUs. Our results show that, contrary to conventional wisdom, using OpenCL it is possible to achieve a high degree of performance portability, at least for structured grid applications, using a set of straightforward techniques. The performance portable code in OpenCL is also highly competitive with the best performance using the native parallel programming models on each platform.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Moore, G.: Cramming more components onto integrated circuits. Electronics Magazine, 114–117 (April 1965)
Demmel, J., Dongarra, J., Parlett, B., Kahan, W., Gu, M., Bindel, D., Hida, Y., Li, X., Marques, O., Riedy, E.J., et al.: Prospectus for a dense linear algebra software library (April 2006)
Munshi, A. (ed.): The Khronous OpenCL Working Group: The OpenCL specification (2008)
Case, D., Darden, T., Cheatham III, T., Simmerling, C., Wang, J., Duke, R., Luo, R., Walker, R., Zhang, W., Merz, K., et al.: AMBER 2012. University of California, San Francisco (2012)
Götz, A.W., Williamson, M.J., Xu, D., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. Generalized Born. Journal of Chemical Theory and Computation 8(5), 1542–1555 (2012)
Salomon-Ferrer, R., Götz, A.W., Poole, D., Le Grand, S., Walker, R.C.: Routine microsecond molecular dynamics simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald. Journal of Chemical Theory and Computation 9(9), 3878–3888 (2013)
Grand, S.L., Götz, A.W., Walker, R.C.: SPFP: Speed without compromise—a mixed precision model for GPU accelerated molecular dynamics simulations. Computer Physics Communications 184(2), 374–380 (2013)
Davidson, A., Owens, J.: Toward techniques for auto-tuning gpu algorithms. In: Jónasson, K. (ed.) PARA 2010, Part II. LNCS, vol. 7134, pp. 110–119. Springer, Heidelberg (2012)
Zhang, Y., Sinclair II, M., Chien, A.A.: Improving performance portability in OpenCL programs. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds.) ISC 2013. LNCS, vol. 7905, pp. 136–150. Springer, Heidelberg (2013)
McIntosh-Smith, S., Price, J., Sessions, R.B., Ibarra, A.A.: High performance in silico virtual drug screening on many-core processors. International Journal of High Performance Computing Applications (IJHPCA) (April 2014)
McIntosh-Smith, S., Sessions, R.B.: An accelerated, computer assisted molecular modeling method for drug design. In: International Supercomputing (June 2008)
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (2006)
Colella, P.: Defining software requirements for scientific computing (2004)
Boltzmann, L.: Weitere studien über das Wärmegleichgewicht unter gasmolekülen (further studies on the heat equilibrium of gas molecules). Wiener Berichte 66, 275–370 (1872)
Qian, Y.H., D’Humières, D., Lallemand, P.: Lattice BGK models for Navier-Stokes equation. EPL (Europhysics Letters) 17(6), 479 (1992)
Succi, S.: The Lattice Boltzmann Equation: For Fluid Dynamics and Beyond. Numerical Mathematics and Scientific Computation. Clarendon Press (2001)
Habich, J., Zeiser, T., Hager, G., Wellein, G.: Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA. Advances in Engineering Software 42(5), 266–272 (2011)
Mawson, M., Revell, A.: Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs. arXiv preprint arXiv:1309.1983 (2013)
Januszewski, M., Kostur, M.: Sailfish: a flexible multi-GPU implementation of the lattice Boltzmann method. ArXiv e-prints (November 2013)
Allen, C.B.: An unsteady multiblock multigrid scheme for lifting forward flight rotor simulation. International Journal for Numerical Methods in Fluids 45(9), 973–984 (2004)
Allen, C.B.: Parallel universal approach to mesh motion and application to rotors in forward flight. International Journal for Numerical Methods in Engineering 69(10), 2126–2149 (2007)
Allen, C.B.: Parallel simulation of unsteady hovering rotor wakes. International Journal for Numerical Methods in Engineering 68(6), 632–649 (2006)
Rendall, T.C.S., Allen, C.B.: Unified fluid–structure interpolation and mesh motion using radial basis functions. International Journal for Numerical Methods in Engineering 74(10), 1519–1559 (2008)
Allen, C.B., Rendall, T.C.: CFD-based optimization of hovering rotors using radial basis functions for shape parameterization and mesh deformation. Optimization and Engineering 14(1), 97–118 (2013)
Herdman, J., Gaudin, W., McIntosh-Smith, S., Boulton, M., Beckingsale, D., Mallinson, A., Jarvis, S.: Accelerating hydrocodes with OpenACC, OpenCL and CUDA. In: High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, pp. 465–471 (November 2012)
Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, A., Rajan, M., Keiter, E.R., Thornquist, H.K., Numrich, R.W.: Improving performance via mini-applications. Sandia National Laboratories. Tech. Rep. (2009)
Sandia National Laboratory: The Mantevo project home page (February 2014), http://mantevo.org
Mallinson, A.C., Beckingsale, D.A., Gaudin, W.P., Herdman, J.A., Jarvis, S.A.: Towards portable performance for explicit hydrodynamics codes. In: Proceedings of the 1st International Workshop on OpenCL (IWOCL 2013). ACM (May 2013)
Saad, Y.: Iterative methods for sparse linear systems. SIAM (2003)
Servat, H., Teruel, X., Llort, G., Duran, A., Giménez, J., Martorell, X., Ayguadé, E., Labarta, J.: On the instrumentation of OpenMP and OmpSs tasking constructs. In: Caragiannis, I., et al. (eds.) Euro-Par Workshops 2012. LNCS, vol. 7640, pp. 414–428. Springer, Heidelberg (2013)
Komatsu, K., Sato, K., Arai, Y., Koyama, K., Takizawa, H., Kobayashi, H.: Evaluating performance and portability of OpenCL programs. In: The Fifth International Workshop on Automatic Performance Tuning (2010)
Rul, S., Vandierendonck, H., D’Haene, J., De Bosschere, K.: An experimental study on performance portability of OpenCL kernels. In: 2010 Symposium on Application Accelerators in High Performance Computing (2010) (papers)
Seo, S., Jo, G., Lee, J.: Performance characterization of the NAS parallel benchmarks in OpenCL. In: 2011 IEEE International Symposium on Workload Characterization (IISWC), pp. 137–148. IEEE (2011)
Pennycook, S., Hammond, S., Wright, S., Herdman, J., Miller, I., Jarvis, S.: An investigation of the performance portability of OpenCL. Journal of Parallel and Distributed Computing 73(11), 1439–1450 (2013)
Du, P., Weber, R., Luszczek, P., Tomov, S., Peterson, G., Dongarra, J.: From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Computing 38(8), 391–407 (2012)
Cao, C., Dongarra, J., Du, P., Gates, M., Luszczek, P., Tomov, S.: clMAGMA: High performance dense linear algebra with OpenCL. Technical report (lawn 275), ut-cs-13-706, University of Tennessee Computer Science (March 2013)
Habich, J., Feichtinger, C., Kostler, H., Hager, G., Wellein, G.: Performance engineering for the lattice Boltzmann method on GPGPUs: Architectural requirements and performance results. ArXiv e-prints (December 2011)
Gray, A., Stratford, K.: Ludwig: multiple GPUs for a complex fluid lattice Boltzmann application. In: Couturier, R. (ed.) Designing Scientific Applications on GPUs. Chapman & Hall/CRC Numerical Analysis and Scientific Computing Series, Taylor & Francis (2013)
Gray, A., Hart, A., Henrich, O., Stratford, K.: Scaling soft matter physics to thousands of GPUs in parallel (2013)
Xiong, Q., Li, B., Xu, J., Fang, X., Wang, X., Wang, L., He, X., Ge, W.: Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units. Chinese Science Bulletin 57(7), 707–715 (2012)
Geveler, M., Ribbrock, D., Mallach, S., Goddeke, D.: A simulation suite for Lattice-Boltzmann based real-time CFD applications exploiting multi-level parallelism on modern multi- and many-core architectures. Journal of Computational Science 2(2), 113–123 (2011)
Brandvik, T., Pullan, G.: Acceleration of a 3D Euler solver using commodity graphics hardware. In: 46th AIAA Aerospace Sciences Meeting and Exhibit, January 2008, pp. 607–661 (2008)
Elsen, E., LeGresley, P., Darve, E.: Large calculation of the flow over a hypersonic vehicle using a GPU. Journal of Computational Physics 227(24), 10148–10161 (2008)
Cohen, J., Molemaker, M.J.: A fast double precision CFD code using CUDA. In: Parallel Computational Fluid Dynamics: Recent Advances and Future Directions, pp. 414–429 (2009)
Göddeke, D., Buijssen, S., Wobker, H., Turek, S.: GPU acceleration of an unmodified parallel finite element Navier-Stokes solver. In: International Conference on High Performance Computing Simulation, HPCS 2009, pp. 12–21 (June 2009)
Phillips, E.H., Zhang, Y., Davis, R.L., Owens, J.D.: Rapid aerodynamic performance prediction on a cluster of graphics processing units. In: Proceedings of the 47th AIAA Aerospace Sciences Meeting, pp. 1–11 (2009)
Barnette, D.W., Barrett, R.F., Hammond, S.D., Jayaraj, J., Laros III, J.H.: Using miniapplications in a Mantevo framework for optimizing Sandia’s SPARC CFD code on multi-core, many-core, and GPU-accelerated compute platforms. Technical report, Sandia National Laboratories (2012)
Mallinson, A., Beckingsale, D., Gaudin, W., Herdman, J., Levesque, J., Jarvis, S.: CloverLeaf: Preparing hydrodynamics codes for Exascale. Cray User Group (CUG), Napa Valley (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
McIntosh-Smith, S., Boulton, M., Curran, D., Price, J. (2014). On the Performance Portability of Structured Grid Codes on Many-Core Computer Architectures. In: Kunkel, J.M., Ludwig, T., Meuer, H.W. (eds) Supercomputing. ISC 2014. Lecture Notes in Computer Science, vol 8488. Springer, Cham. https://doi.org/10.1007/978-3-319-07518-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-07518-1_4
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07517-4
Online ISBN: 978-3-319-07518-1
eBook Packages: Computer ScienceComputer Science (R0)