Skip to main content
Log in

cuConv: CUDA implementation of convolution for CNN inference

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Convolutions are the core operation of deep learning applications based on Convolutional Neural Networks (CNNs). Current GPU architectures are highly efficient for training and deploying deep CNNs, and are largely used in production. State–of–the–art implementations, however, present low efficiency for some commonly used network configurations. In this paper we propose a GPU-based implementation of the convolution operation for CNN inference that favors coalesced accesses, without requiring prior data transformations. Our experiments demonstrate that it yields notable performance improvements in a range of common CNN forward-propagation convolution configurations, with speedups of up to 2.29 × with respect to the best implementation in cuDNN, covering a relevant region in currently existing approaches. This improvement results in speedups of up to 7.4% for CNN online inference use cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I.J., Harp, A., Irving, G., Isard, M., Jia, Y., Józefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D.G., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P.A., Vanhoucke, V., Vasudevan, V., Viégas, F.B., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous distributed systems. CoRR arXiv:1603.04467 (2016)

  2. Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., Shelhamer, E.: cuDNN: efficient primitives for deep learning. CoRR (2014)

  3. D455, I.R.D.C.: https://www.intelrealsense.com/depth-camera-d455 (2021)

  4. Dongarra, J.J., Hammarling, S., Higham, N.J., Relton, S.D., Valero-Lara, P., Zounon, M.: The design and performance of batched BLAS on modern high-performance computing systems. In: International conference on computational science (ICCS), pp. 495–504 (2017)

  5. Dryden, N., Maruyama, N., Moon, T., Benson, T., Snir, M., Van Essen, B.: Channel and filter parallelism for large-scale CNN training. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, SC 2019. Association for computing machinery, New York, NY, USA (2019). https://doi.org/10.1145/3295500.3356207

  6. Fedorov, I., Adams, R.P., Mattina, M., Whatmough, P.: Sparse: Sparse architecture search for cnns on resource-constrained microcontrollers. In: Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/044a23cadb567653eb51d4eb40acaa88-Paper.pdf

  7. Fukushima, K.: Cognitron: a self-organizing multilayer neural network. Biol. Cybernet. 20, 121–136 (1975)

    Article  Google Scholar 

  8. Fukushima, K., Miyake, S., Ito, T.: Neocognitron: a neural network model for a mechanism of visual pattern recognition. IEEE Trans. Syst. Man Cybernet. SMC–13, 826–834 (1983)

    Article  Google Scholar 

  9. Geng, T., Wang, T., Wu, C., Yang, C., Song, S., Li, A., Herbordt, M.: LP-BNN: Ultra-low-latency BNN inference with layer parallelism. IEEE 30th international conference on application-specific systems, architectures and processors (ASAP) 2160–052X, 9–16 (2019)

  10. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, G.: Recent advances in convolutional neural networks. CoRR arXiv:1512.07108 (2015)

  11. Han, S., Pool, J., Tran, J., Dally, W.J.: Learning both weights and connections for efficient neural networks. CoRR arXiv:1506.02626 (2015)

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR arXiv:1512.03385 (2015)

  13. Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., Dally, W.J., Keutzer, K.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(<1\text{MB}\) model size. CoRR arXiv:1602.07360 (2016)

  14. IntelAI: https://github.com/IntelAI/models (2021)

  15. Jorda, M., Valero-Lara, P., Peña, A.J.: Performance evaluation of cuDNN convolution algorithms on NVIDIA Volta GPUs. IEEE Access 7, 70461–70473 (2019)

    Article  Google Scholar 

  16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)

    Google Scholar 

  17. Lavin, A.: Fast algorithms for convolutional neural networks. CoRR arXiv:1509.09308 (2015)

  18. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  19. Li, C., Yang, Y., Feng, M., Chakradhar, S., Zhou, H.: Optimizing memory efficiency for deep convolutional neural networks on GPUs. In: Proceedings of the international conference for high performance computing, networking, storage and analysis (SC16), pp. 633–644 (2016)

  20. Liu, X., Pool, J., Han, S., Dally, W.J.: Efficient sparse-winograd convolutional neural networks. CoRR arXiv:1802.06367 (2018)

  21. Mathieu, M., Henaff, M., LeCun, Y.: Fast training of convolutional networks through FFTs. CoRR arXiv:1312.5851 (2013)

  22. Mogers, N., Radu, V., Li, L., Turner, J., OBoyle, M., Dubach, C.: Automatic generation of specialized direct convolutions for mobile GPUs. In: Proceedings of the 13th Annual workshop on general purpose processing using graphics processing unit, GPGPU 20, pp. 41–50. Association for computing machinery, New York, NY, USA (2020). https://doi.org/10.1145/3366428.3380771

  23. NVIDIA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html (2021)

  24. Park, J., Li, S.R., Wen, W., Tang, P.T.P., Li, H., Chen, Y., Dubey, P.: Faster CNNs with direct sparse convolutions and guided pruning. In: 5th International conference on learning representations, ICLR 2017, Toulon, France, April 24-26, 2017, conference track proceedings (2017). https://openreview.net/forum?id=rJPcZ3txx

  25. Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions. CoRR arXiv:1710.05941 (2017)

  26. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  27. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556 (2014)

  28. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, AAAI17, pp. 4278–4284. AAAI Press (2017)

  29. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR arXiv:1409.4842 (2014)

  30. Vasilache, N., Johnson, J., Mathieu, M., Chintala, S., Piantino, S., LeCun, Y.: Fast convolutional nets with fbfft: a GPU performance evaluation. CoRR arXiv:1412.7580 (2014)

  31. Wen, N., Guo, R., He, B., Fan, Y., Ma, D.: Block-sparse cnn: towards a fast and memory-efficient framework for convolutional neural networks. Appl. Intell. 51, 1–12 (2021)

    Article  Google Scholar 

  32. Winograd, S.: Arithmetic complexity of computations. CBMS-NSF regional conference series in applied mathematics. Society for industrial and Applied Mathematics (1980). https://books.google.es/books?id=GU1NQJBcWIsC

  33. Yan, D., Wang, W., Chu, X.: Optimizing batched winograd convolution on GPUs. In: Proceedings of the 25th ACM SIGPLAN symposium on principles and practice of parallel programming, PPoPP 20, pp. 32–44. Association for computing machinery, New York, NY, USA (2020). https://doi.org/10.1145/3332466.3374520

  34. You, W., Wu, C.: RSNN: a software/hardware co-optimized framework for sparse convolutional neural networks on FPGAs. IEEE Access 9, 949–960 (2021). https://doi.org/10.1109/ACCESS.2020.3047144

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marc Jordà.

Ethics declarations

Conflict of interest

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jordà, M., Valero-Lara, P. & Peña, A.J. cuConv: CUDA implementation of convolution for CNN inference. Cluster Comput 25, 1459–1473 (2022). https://doi.org/10.1007/s10586-021-03494-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-021-03494-y

Keywords

Navigation