A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

Do, Cong Thuan; Choi, Hong Jun; Chung, Sung Woo; Kim, Cheol Hong

doi:10.1007/s11227-019-03091-2

A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

Published: 23 November 2019

Volume 76, pages 3043–3062, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Cong Thuan Do¹,
Hong Jun Choi²,
Sung Woo Chung¹ &
…
Cheol Hong Kim ORCID: orcid.org/0000-0003-1837-6631²

544 Accesses
8 Citations
Explore all metrics

Abstract

Graphics processing units (GPUs) have become one of the best platforms for exploiting the plentiful thread-level parallelism of applications. However, GPUs continue to underutilize their hardware resources for optimizing the performance of numerous general-purpose applications. One primary reason for this is the inefficiency of existing warp schedulers in hiding long-latency operations such as global loads and stores. This study proposes a long-latency operation-based warp scheduler to improve GPU performance. In the proposed warp scheduler, warps are partitioned into different pools based on the characteristics of instructions that are subsequently executed. Specifically, this warp scheduler uses warps that are likely waiting for long-latency operations for a guiding role. Meanwhile, other warps perform filling roles (i.e., to overlap the latencies caused by the guiding warps). Our experimental results demonstrate that the proposed warp scheduler improves GPU performance by 24.4% on average as compared to the conventional warp scheduler.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic Selective Warp Scheduling for GPUs Using L1 Data Cache Locality Information

Making GPU Warp Scheduler and Memory Scheduler Synchronization-Aware

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

Article 01 February 2018

References

Owens JD, Luebke D, Govindaraju N, Harris M, Kruger J, Lefohn A, Purcell T (2005) A survey of general-purpose computation on graphics hardware. In: Eurographics 2005, State of the Art Reports, pp 21–51
Ryoo S, Rodrigues CI, Baghsorkhi SS, Stone SS, Kirk DB, Hwu W-MW (2008) Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp 73–82
Melab N, Gmys J, Mezmaz M, Tuyttens D (2020) Many-core branch-and-bound for GPU accelerators and MIC coprocessors. In: High-Performance Simulation-Based Optimization, pp 275–291
Lee J, Shi W, Gil J (2018) Accelerated bulk memory operations on heterogeneous multi-core systems. J Supercomput 74(12):6898–6922
Article Google Scholar
Awatranani M, Zhu Z, Zambreno J, Rover D (2015) Phase aware warp scheduling: mitigating effects of phase behavior in GPGPU applications. In: Proceedings of International Conference on Parallel Architecture and Compilation (PACT)
Fung WWL, Aamodt TM (2011) Thread block compaction for efficient SIMT control flow. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 25–36
Kayiran O, Jog A, Kandemir MT, Das CR (2012) Neither more nor less: optimizing thread-level parallelism for GPGPUs. CSE Penn State technical report, TR-CES-2212-006
Narasiman V, Shebanow M, Lee CJ, Miftakhutdinov R, Mutlu O, Patt YN (2011) Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 308–317
Rogers TG, O’Connor M, Aamodt TM (2013) Cache-conscious wavefront scheduling. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 72–83
Lee M, Kim G, Kim J, Seo W, Cho Y, Ryu S (2016) iPAWS: instruction-issue pattern-based adaptive warp scheduling for GPGPUs. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 370–381
Jog A, Kayiran O, Nachiappan N, Mishra A, Kandemir M, Mutlu O, Iyer R, Das C (2013) OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In: Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp 395–406
Bakhola A, Yuan G, Fung WW, Wong H, Aamodt TM (2009) Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of International Symposium on Analysis of Systems and Software (ISPASS), pp 163–174
Chen J, Tao X, Yang Z, Peir JK, Li X, Lu SL (2013) Guided region-based GPU scheduling: utilizing multi-thread parallelism to hide memory latency. In: Proceedings of IEEE International Symposium on Parallel and Distributed Processing (IPDPS), pp 441–451
Koo G, Oh Y, Ro W, Annavaram M (2017) Access pattern-aware cache management for improving data utilization in GPU. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 307–319
Do CT, Kim JM, Kim CH (2017) Early miss prediction based periodic cache bypassing for high performance GPUs. Microprocess Microsyst 55:44–54
Article Google Scholar
Do CT, Kim JM, Kim CH (2018) Application characteristics-aware sporadic cache bypassing for high performance GPGPUs. J Parallel Distrib Comput 122:238–250
Article Google Scholar
Xu J et al (2019) Optimizing finite volume method solvers on NVIDIA GPUs. IEEE Trans Parallel Distrib Syst 30:2790–2805
Article Google Scholar
Jog A, Kayiran O, Mishra A, Kandemin M, Mutlu O, Iyer R, Das C (2013) Orchestrated scheduling and prefetching for GPGPUs. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 332–343
Oh Y, Kim K, Yoon MK, Park JH, Park YJ, Ro WW, Annavaram M (2016) APRES: improving cache efficiency by exploiting load characteristics on GPUs. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 191–203
Gebhart M, Johnson RD, Tarjan D, Keckler SW, Dally WJ, Lindoholm E, Skadron K (2011) Energy-efficient mechanisms for managing thread context in throughput processors. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 235–246
Lee S-Y, Arunkumar A, Wu C-J (2015) CAWA: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 515–527
Gong X, Gong X, Yu L, Kaeli D (2019) HAWS: accelerating GPU wavefront execution through selective out-of-order execution. ACM Trans Archit Code Optim 16(2):1–22
Article Google Scholar
Zhang Y, Xing Z, Liu C, Tang C (2018) CWLP: coordinated warp scheduling and locality protected cache allocation on GPUs. Front Inf Technol Electron Eng 19(2):206–220
Article Google Scholar
Lee M, Song S, Moon J, Kim J, Seo W, Cho Y, Ryu S (2014) Improving GPGPU resource utilization through alternative thread block scheduling. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 206–271
Son DO, Do CT, Choi HJ, Nam J, Kim CH (2017) A dynamic CTA scheduling scheme for massive parallel computing. Clust Comput 20(1):781–787
Article Google Scholar
Son DO, Do CT, Choi HJ, Kim JM, Park J, Kim CH (2016) CTA-aware dynamic scheduling scheme for streaming multiprocessors in high-performance GPUs. In: Information Science and Application (ICISA), pp 1391–1399
Guz Z, Bolotin E, Keidar I, Kolodny A, Mendelson A, Weiser U (2009) Many-core vs. many-thread machines: stay away from the valley. Comput Archit Lett 8(1):25–28
Article Google Scholar
Cheng H-Y, Lin C-H, Li J, Yang C-L (2010) Memory latency reduction via thread throttling. In: Proceedings of International Symposium on Microarchitecture (MICRO), pp 53–64
Fung WWL, Sham I, Yuan G, Aamodt TM (2007) Dynamic warp formation and scheduling for efficient GPU control flow. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 407–420
Lee J, Lakshminarayana NB, Kim H, Vuduc R (2010) Many-thread aware prefetching mechanisms for GPGPU applications. In: Proceedings of IEEE/ACM International Symposium on Microarchitecture (MICRO), pp 213–224
Harish VVP, Narayanan PJ (2009) Large graph algorithms for massively multithreaded architectures. Technical report, IIIT
Owens J, Houston M, Luebke D, Green S, Stone J, Phillips J (2008) GPU computing. In: Proceedings of the IEEE
Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: Proceedings of IEEE/ACM International on Supercomputing (SC)
Do CT, Choi HJ, Kim JM, Kim CH (2015) A new cache replacement algorithm for last-level caches by exploiting tag-distance correlation of cache lines. Microprocess Microsyst 39(4):286–295
Article Google Scholar
Jaleel A, Theobald KB, Steely SC, Emer J (2010) High performance cache replacement using re-reference interval prediction (RRIP). In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 60–71
Qureshi MK, Jaleel A, Patt YN, Steely SC, Emer J (2007) Adaptive insertion policies for high performance caching. In: Proceedings of International Symposium on Computer Architecture (ISCA), pp 381–391
Baskaran MM, Bondhugula U, Krishnamoorthy S, Ramanujam J, Rountev A, Sadayappan P (2008) A compiler framework for optimization of affine loop nests for GPGPUs. In: Proceedings of International Conference on Supercomputing (ICS), pp 225–234
Yang Y, Xiang P, Kong J, Zhou H (2010) A GPGPU compiler for memory optimization and parallelism management. In: Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp 86–97
Ryoo S, Rodrigues CI, Stone SS, Baghsorkhi SS, Ueng S-Z, Stratton JA, Hwu W-M (2008) Program optimization space pruning for a multithreaded GPU. In: Proceedings of IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp 195–204
Li C, Song S, Dai H, Sidelnik A, Hari S, Zhou H (2015) Locality driven dynamic GPU cache bypassing. In: Proceedings of ACM International Conference on Supercomputing (ICS), pp 67–77
Jia W, Shaw K, Martonosi M (2014) MRPB: memory request prioritization for massively parallel processors. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 272–283
Xie X, Liang Y, Wang Y, Sun G, Wang T (2015) Coordinated static and dynamic cache bypassing for GPUs. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), pp 73–88
NVIDIA (2009) Whitepaper: NVIDIA’s next generation CUDA compute and graphics architecture: Fermi. NVIDIA, Santa Clara
Google Scholar
NVIDIA whitepaper: NVIDIA GeForce GTX 980
Munshi A (2011) The OpenCL specification
NVIDIA (2010) C programming guide. NVIDIA, Santa Clara
Google Scholar
Kirk D, Hwu WW (2010) Programming massively parallel processors. Morgan Kaufmann, Los Altos
Google Scholar
Abdalla KM et al. (2013) Scheduling and execution of compute tasks. US patent US20130185725
Zhe J, Maggioni M, Staiger B, Scarpazza D (2018) Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. arXiv preprint arXiv:1804.06826
Che S, Boyer M, Meng J, Tarjan D, Sheaffer J, Lee S-H, Skadron K (2009) Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of International Symposium on Workload Characterization (IISWC), pp 44–54
NVIDA, CUDA SDK. http://developer.nvidia.com/gpu-computing-sdk

Download references

Acknowledgements

This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1A2B6005740), and it was also supported by the MSIP (Ministry of Science, ICT and Future Planning), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2016-R2718-16-0011) supervised by the IITP (Institute for Information and Communications Technology Promotion).

Author information

Authors and Affiliations

Department of Computer Science, Korea University, Seoul, Korea
Cong Thuan Do & Sung Woo Chung
School of Electronics and Computer Engineering, Chonnam National University, Gwangju, Korea
Hong Jun Choi & Cheol Hong Kim

Authors

Cong Thuan Do
View author publications
You can also search for this author in PubMed Google Scholar
Hong Jun Choi
View author publications
You can also search for this author in PubMed Google Scholar
Sung Woo Chung
View author publications
You can also search for this author in PubMed Google Scholar
Cheol Hong Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheol Hong Kim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Do, C.T., Choi, H.J., Chung, S.W. et al. A novel warp scheduling scheme considering long-latency operations for high-performance GPUs. J Supercomput 76, 3043–3062 (2020). https://doi.org/10.1007/s11227-019-03091-2

Download citation

Published: 23 November 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s11227-019-03091-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

Abstract

Access this article

Similar content being viewed by others

Dynamic Selective Warp Scheduling for GPUs Using L1 Data Cache Locality Information

Making GPU Warp Scheduler and Memory Scheduler Synchronization-Aware

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel warp scheduling scheme considering long-latency operations for high-performance GPUs

Abstract

Access this article

Similar content being viewed by others

Dynamic Selective Warp Scheduling for GPUs Using L1 Data Cache Locality Information

Making GPU Warp Scheduler and Memory Scheduler Synchronization-Aware

CWLP: coordinated warp scheduling and locality-protected cache allocation on GPUs

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation