Skip to main content
Log in

A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Graphics processing units (GPUs) have become one of the best platforms for exploiting the plentiful thread-level parallelism of applications. However, GPUs continue to underutilize their hardware resources for optimizing the performance of numerous general-purpose applications. One primary reason for this is the inefficiency of existing warp schedulers in hiding long-latency operations such as global loads and stores. This study proposes a long-latency operation-based warp scheduler to improve GPU performance. In the proposed warp scheduler, warps are partitioned into different pools based on the characteristics of instructions that are subsequently executed. Specifically, this warp scheduler uses warps that are likely waiting for long-latency operations for a guiding role. Meanwhile, other warps perform filling roles (i.e., to overlap the latencies caused by the guiding warps). Our experimental results demonstrate that the proposed warp scheduler improves GPU performance by 24.4% on average as compared to the conventional warp scheduler.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Owens JD, Luebke D, Govindaraju N, Harris M, Kruger J, Lefohn A, Purcell T (2005) A survey of general-purpose computation on graphics hardware. In: Eurographics 2005, State of the Art Reports, pp 21–51

  2. Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu W-MW (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp 73–82

  3. Melab N, Gmys J, Mezmaz M, Tuyttens D (2020) Many-core branch-and-bound for GPU accelerators and MIC coprocessors. In: High-Performance Simulation-Based Optimization, pp 275–291

  4. Lee J, Shi W, Gil J (2018) Accelerated bulk memory operations on heterogeneous multi-core systems. J Supercomput 74(12):6898–6922

    Article  Google Scholar 

  5. Awatranani M, Zhu Z, Zambreno J, Rover D (2015) Phase aware warp scheduling: mitigating effects of phase behavior in GPGPU applications. In: Proceedings of International Conference on Parallel Architecture and Compilation (PACT)

  6. Fung WWL, Aamodt TM (2011) Thread block compaction for efficient SIMT control flow. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 25–36

  7. Kayiran O, Jog A, Kandemir MT, Das CR (2012) Neither more nor less: optimizing thread-level parallelism for GPGPUs. CSE Penn State technical report, TR-CES-2212-006

  8. Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 308–317

  9. Rogers TG, O’Connor M, Aamodt TM (2013) Cache-conscious wavefront scheduling. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 72–83

  10. Lee M, Kim G, Kim J, Seo W, Cho Y, Ryu S (2016) iPAWS: instruction-issue pattern-based adaptive warp scheduling for GPGPUs. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 370–381

  11. Jog A, Kayiran O, Nachiappan N, Mishra A, Kandemir M, Mutlu O, Iyer R, Das C (2013) OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In: Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp 395–406

  12. Bakhola A, Yuan G, Fung WW, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of International Symposium on Analysis of Systems and Software (ISPASS), pp 163–174

  13. Chen J, Tao X, Yang Z, Peir JK, Li X, Lu SL (2013) Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. In: Proceedings of IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp 441–451

  14. Koo G, Oh Y, Ro W, Annavaram M (2017) Access pattern-aware cache management for improving data utilization in GPU. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 307–319

  15. Do CT, Kim JM, Kim CH (2017) Early miss prediction based periodic cache bypassing for high performance GPUs. Microprocess Microsyst 55:44–54

    Article  Google Scholar 

  16. Do CT, Kim JM, Kim CH (2018) Application characteristics-aware sporadic cache bypassing for high performance GPGPUs. J Parallel Distrib Comput 122:238–250

    Article  Google Scholar 

  17. Xu J et al (2019) Optimizing finite volume method solvers on NVIDIA GPUs. IEEE Trans Parallel Distrib Syst 30:2790–2805

    Article  Google Scholar 

  18. Jog A, Kayiran O, Mishra A, Kandemin M, Mutlu O, Iyer R, Das C (2013) Orchestrated scheduling and prefetching for GPGPUs. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 332–343

  19. Oh Y, Kim K, Yoon MK, Park JH, Park YJ, Ro WW, Annavaram M (2016) APRES: improving cache efficiency by exploiting load characteristics on GPUs. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 191–203

  20. Gebhart M, Johnson RD, Tarjan D, Keckler SW, Dally WJ, Lindoholm E, Skadron K (2011) Energy-efficient mechanisms for managing thread context in throughput processors. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 235–246

  21. Lee S-Y, Arunkumar A, Wu C-J (2015) CAWA: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 515–527

  22. Gong X, Gong X, Yu L, Kaeli D (2019) HAWS: accelerating GPU wavefront execution through selective out-of-order execution. ACM Trans Archit Code Optim 16(2):1–22

    Article  Google Scholar 

  23. Zhang Y, Xing Z, Liu C, Tang C (2018) CWLP: coordinated warp scheduling and locality protected cache allocation on GPUs. Front Inf Technol Electron Eng 19(2):206–220

    Article  Google Scholar 

  24. Lee M, Song S, Moon J, Kim J, Seo W, Cho Y, Ryu S (2014) Improving GPGPU resource utilization through alternative thread block scheduling. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 206–271

  25. Son DO, Do CT, Choi HJ, Nam J, Kim CH (2017) A dynamic CTA scheduling scheme for massive parallel computing. Clust Comput 20(1):781–787

    Article  Google Scholar 

  26. Son DO, Do CT, Choi HJ, Kim JM, Park J, Kim CH (2016) CTA-aware dynamic scheduling scheme for streaming multiprocessors in high-performance GPUs. In: Information Science and Application (ICISA), pp 1391–1399

  27. Guz Z, Bolotin E, Keidar I, Kolodny A, Mendelson A, Weiser U (2009) Many-core vs. many-thread machines: stay away from the valley. Comput Archit Lett 8(1):25–28

    Article  Google Scholar 

  28. Cheng H-Y, Lin C-H, Li J, Yang C-L (2010) Memory latency reduction via thread throttling. In: Proceedings of International Symposium on Microarchitecture (MICRO), pp 53–64

  29. Fung WWL, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 407–420

  30. Lee J, Lakshminarayana NB, Kim H, Vuduc R (2010) Many-thread aware prefetching mechanisms for GPGPU applications. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 213–224

  31. Harish VVP, Narayanan PJ (2009) Large graph algorithms for massively multithreaded architectures. Technical report, IIIT

  32. Owens J, Houston M, Luebke D, Green S, Stone J, Phillips J (2008) GPU computing. In: Proceedings of the IEEE

  33. Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: Proceedings of IEEE/ACM International on Supercomputing (SC)

  34. Do CT, Choi HJ, Kim JM, Kim CH (2015) A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines. Microprocess Microsyst 39(4):286–295

    Article  Google Scholar 

  35. Jaleel A, Theobald KB, Steely SC, Emer J (2010) High performance cache replacement using re-reference interval prediction (RRIP). In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 60–71

  36. Qureshi MK, Jaleel A, Patt YN, Steely SC, Emer J (2007) Adaptive insertion policies for high performance caching. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 381–391

  37. Baskaran MM, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P (2008) A compiler framework for optimization of affine loop nests for GPGPUs. In: Proceedings of International Conference on Supercomputing (ICS), pp 225–234

  38. Yang Y, Xiang P, Kong J, Zhou H (2010) A GPGPU compiler for memory optimization and parallelism management. In: Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp 86–97

  39. Ryoo S, Rodrigues CI, Stone SS, Baghsorkhi SS, Ueng S-Z, Stratton JA, Hwu W-M (2008) Program optimization space pruning for a multithreaded GPU. In: Proceedings of IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp 195–204

  40. Li C, Song S, Dai H, Sidelnik A, Hari S, Zhou H (2015) Locality driven dynamic GPU cache bypassing. In: Proceedings of ACM International Conference on Supercomputing (ICS), pp 67–77

  41. Jia W, Shaw K, Martonosi M (2014) MRPB: memory request prioritization for massively parallel processors. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 272–283

  42. Xie X, Liang Y, Wang Y, Sun G, Wang T (2015) Coordinated static and dynamic cache bypassing for GPUs. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 73–88

  43. NVIDIA (2009) Whitepaper: NVIDIA’s next generation CUDA compute and graphics architecture: Fermi. NVIDIA, Santa Clara

    Google Scholar 

  44. NVIDIA whitepaper: NVIDIA GeForce GTX 980

  45. Munshi A (2011) The OpenCL specification

  46. NVIDIA (2010) C programming guide. NVIDIA, Santa Clara

    Google Scholar 

  47. Kirk D, Hwu WW (2010) Programming massively parallel processors. Morgan Kaufmann, Los Altos

    Google Scholar 

  48. Abdalla KM et al. (2013) Scheduling and execution of compute tasks. US patent US20130185725

  49. Zhe J, Maggioni M, Staiger B, Scarpazza D (2018) Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826

  50. Che S, Boyer M, Meng J, Tarjan D, Sheaffer J, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of International Symposium on Workload Characterization (IISWC), pp 44–54

  51. NVIDA, CUDA SDK. http://developer.nvidia.com/gpu-computing-sdk

Download references

Acknowledgements

This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1A2B6005740), and it was also supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2016-R2718-16-0011) supervised by the IITP (Institute for Information and Communications Technology Promotion).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheol Hong Kim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Do, C.T., Choi, H.J., Chung, S.W. et al. A novel warp scheduling scheme considering long-latency operations for high-performance GPUs. J Supercomput 76, 3043–3062 (2020). https://doi.org/10.1007/s11227-019-03091-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-03091-2

Keywords

Navigation