cuConv: CUDA implementation of convolution for CNN inference

Jordà, Marc; Valero-Lara, Pedro; Peña, Antonio J.

doi:10.1007/s10586-021-03494-y

cuConv: CUDA implementation of convolution for CNN inference

Published: 21 January 2022

Volume 25, pages 1459–1473, (2022)
Cite this article

Cluster Computing Aims and scope Submit manuscript

645 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

Convolutions are the core operation of deep learning applications based on Convolutional Neural Networks (CNNs). Current GPU architectures are highly efficient for training and deploying deep CNNs, and are largely used in production. State–of–the–art implementations, however, present low efficiency for some commonly used network configurations. In this paper we propose a GPU-based implementation of the convolution operation for CNN inference that favors coalesced accesses, without requiring prior data transformations. Our experiments demonstrate that it yields notable performance improvements in a range of common CNN forward-propagation convolution configurations, with speedups of up to 2.29 × with respect to the best implementation in cuDNN, covering a relevant region in currently existing approaches. This improvement results in speedups of up to 7.4% for CNN online inference use cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Accelerating CNNs Using Optimized Scheduling Strategy

Accelerating Deep Convolutional Neural Network Inference Based on OpenCL

Optimizing Winograd Convolution on GPUs via Partial Kernel Fusion

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I.J., Harp, A., Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D.G., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P.A., Vanhoucke, V., Vasudevan, V., Viégas, F.B., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous distributed systems. CoRR arXiv:1603.04467 (2016)
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., Shelhamer, E.: cuDNN: efficient primitives for deep learning. CoRR (2014)
D455, I.R.D.C.: https://www.intelrealsense.com/depth-camera-d455 (2021)
Dongarra, J.J., Hammarling, S., Higham, N.J., Relton, S.D., Valero-Lara, P., Zounon, M.: The design and performance of batched BLAS on modern high-performance computing systems. In: International conference on computational science (ICCS), pp. 495–504 (2017)
Dryden, N., Maruyama, N., Moon, T., Benson, T., Snir, M., Van Essen, B.: Channel and filter parallelism for large-scale CNN training. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, SC 2019. Association for computing machinery, New York, NY, USA (2019). https://doi.org/10.1145/3295500.3356207
Fedorov, I., Adams, R.P., Mattina, M., Whatmough, P.: Sparse: Sparse architecture search for cnns on resource-constrained microcontrollers. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/044a23cadb567653eb51d4eb40acaa88-Paper.pdf
Fukushima, K.: Cognitron: a self-organizing multilayer neural network. Biol. Cybernet. 20, 121–136 (1975)
Article Google Scholar
Fukushima, K., Miyake, S., Ito, T.: Neocognitron: a neural network model for a mechanism of visual pattern recognition. IEEE Trans. Syst. Man Cybernet. SMC–13, 826–834 (1983)
Article Google Scholar
Geng, T., Wang, T., Wu, C., Yang, C., Song, S., Li, A., Herbordt, M.: LP-BNN: Ultra-low-latency BNN inference with layer parallelism. IEEE 30th international conference on application-specific systems, architectures and processors (ASAP) 2160–052X, 9–16 (2019)
Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G.: Recent advances in convolutional neural networks. CoRR arXiv:1512.07108 (2015)
Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural networks. CoRR arXiv:1506.02626 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR arXiv:1512.03385 (2015)
Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(<1\text{MB}\) model size. CoRR arXiv:1602.07360 (2016)
IntelAI: https://github.com/IntelAI/models (2021)
Jorda, M., Valero-Lara, P., Peña, A.J.: Performance evaluation of cuDNN convolution algorithms on NVIDIA Volta GPUs. IEEE Access 7, 70461–70473 (2019)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
Google Scholar
Lavin, A.: Fast algorithms for convolutional neural networks. CoRR arXiv:1509.09308 (2015)
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Li, C., Yang, Y., Feng, M., Chakradhar, S., Zhou, H.: Optimizing memory efficiency for deep convolutional neural networks on GPUs. In: Proceedings of the international conference for high performance computing, networking, storage and analysis (SC16), pp. 633–644 (2016)
Liu, X., Pool, J., Han, S., Dally, W.J.: Efficient sparse-winograd convolutional neural networks. CoRR arXiv:1802.06367 (2018)
Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through FFTs. CoRR arXiv:1312.5851 (2013)
Mogers, N., Radu, V., Li, L., Turner, J., OBoyle, M., Dubach, C.: Automatic generation of specialized direct convolutions for mobile GPUs. In: Proceedings of the 13th Annual workshop on general purpose processing using graphics processing unit, GPGPU 20, pp. 41–50. Association for computing machinery, New York, NY, USA (2020). https://doi.org/10.1145/3366428.3380771
NVIDIA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html (2021)
Park, J., Li, S.R., Wen, W., Tang, P.T.P., Li, H., Chen, Y., Dubey, P.: Faster CNNs with direct sparse convolutions and guided pruning. In: 5th International conference on learning representations, ICLR 2017, Toulon, France, April 24-26, 2017, conference track proceedings (2017). https://openreview.net/forum?id=rJPcZ3txx
Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. CoRR arXiv:1710.05941 (2017)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556 (2014)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, AAAI17, pp. 4278–4284. AAAI Press (2017)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR arXiv:1409.4842 (2014)
Vasilache, N., Johnson, J., Mathieu, M., Chintala, S., Piantino, S., LeCun, Y.: Fast convolutional nets with fbfft: a GPU performance evaluation. CoRR arXiv:1412.7580 (2014)
Wen, N., Guo, R., He, B., Fan, Y., Ma, D.: Block-sparse cnn: towards a fast and memory-efficient framework for convolutional neural networks. Appl. Intell. 51, 1–12 (2021)
Article Google Scholar
Winograd, S.: Arithmetic complexity of computations. CBMS-NSF regional conference series in applied mathematics. Society for industrial and Applied Mathematics (1980). https://books.google.es/books?id=GU1NQJBcWIsC
Yan, D., Wang, W., Chu, X.: Optimizing batched winograd convolution on GPUs. In: Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP 20, pp. 32–44. Association for computing machinery, New York, NY, USA (2020). https://doi.org/10.1145/3332466.3374520
You, W., Wu, C.: RSNN: a software/hardware co-optimized framework for sparse convolutional neural networks on FPGAs. IEEE Access 9, 949–960 (2021). https://doi.org/10.1109/ACCESS.2020.3047144
Article Google Scholar

Download references

Author information

Authors and Affiliations

Barcelona Supercomputing Center (BSC), Barcelona, Spain
Marc Jordà, Pedro Valero-Lara & Antonio J. Peña

Authors

Marc Jordà
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Valero-Lara
View author publications
You can also search for this author in PubMed Google Scholar
Antonio J. Peña
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marc Jordà.

Ethics declarations

Conflict of interest

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jordà, M., Valero-Lara, P. & Peña, A.J. cuConv: CUDA implementation of convolution for CNN inference. Cluster Comput 25, 1459–1473 (2022). https://doi.org/10.1007/s10586-021-03494-y

Download citation

Received: 13 November 2020
Revised: 02 October 2021
Accepted: 24 November 2021
Published: 21 January 2022
Issue Date: April 2022
DOI: https://doi.org/10.1007/s10586-021-03494-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

cuConv: CUDA implementation of convolution for CNN inference

Abstract

Access this article

Similar content being viewed by others

Accelerating CNNs Using Optimized Scheduling Strategy

Accelerating Deep Convolutional Neural Network Inference Based on OpenCL

Optimizing Winograd Convolution on GPUs via Partial Kernel Fusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

cuConv: CUDA implementation of convolution for CNN inference

Abstract

Access this article

Similar content being viewed by others

Accelerating CNNs Using Optimized Scheduling Strategy

Accelerating Deep Convolutional Neural Network Inference Based on OpenCL

Optimizing Winograd Convolution on GPUs via Partial Kernel Fusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation