Skip to main content
Log in

Align-Yolact: a one-stage semantic segmentation network for real-time object detection

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Object detection is a classic problem in computer vision. The main bottleneck of object detection lies in the fusion of multi-scale features. In this paper, we systematically study the design choices of neural network architecture for real-time object detection, and propose an Align-Yolact to improve the instance segmentation accuracy. Firstly, we propose a weighted bounding box, which improves the accurate positioning of the bounding box. Secondly, we add a bi-directional feature pyramid network to the feature fusion, which improves the mask quality and small target accuracy. Owing to these optimizations and better backbones, we achieve the SOTA results including both detection efficiency and accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934

  • Bolya D, Zhou C, Xiao F et al (2019) YOLACT: real-time instance segmentation. In: The IEEE International Conference on Computer Vision (ICCV), 27 October–2 November 2019, Korea, pp 9157–9166

  • Chen RC (2020) Automatic license plate recognition via sliding-window darknet-YOLO deep learning. Image vis Comput 87:47–56

    Google Scholar 

  • Deng J, Dong W, Socher R et al (2009) Imagenet: a large-scale hierarchical image database. In: The 2009 IEEE conference on computer vision and pattern recognition (CVPR), 20–21 June 2009, Miami, pp 248–255

  • Duan K, Bai S, Xie L et al (2019) Centernet: Keypoint triplets for object detection. In: The IEEE International Conference on Computer Vision (ICCV), 27 October- 2 November 2019, Korea, pp 6569–6578

  • Ghiasi G, Lin TY, Le QV (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 16–19 June 2019, Las Vegas, pp 7036–7045

  • Girshick R (2015) Fast r-cnn. In: The IEEE international conference on computer vision (ICCV), 7–13 December 2015, Chile, pp 1440–1448

  • Girshick R, Donahue J, Darrell T et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: The IEEE conference on computer vision and pattern recognition, 23–28 June 2014, Ohio, pp 580–587

  • Hariharan B, Arbeláez P, Bourdev L et al (2011) Semantic contours from inverse detectors. In: The 2011 International Conference on Computer Vision (ICCV), 6–13 November 2011, Barcelona, pp 991–998

  • He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 27–30 June 2016, Las Vegas, pp 770–778

  • He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: The IEEE international conference on computer vision (ICCV), 22–29 October 2017, Italy, pp 2961–2969

  • Kong T, Sun F, Liu H et al (2020) Foveabox: beyound anchor-based object detection. IEEE Trans Image Process 29:7389–7398

    Article  MATH  Google Scholar 

  • Law H, Deng J (2018) Cornernet: Detecting objects as paired keypoints. In: The European Conference on Computer Vision (ECCV), 8–14 September 2018, Munich, pp 734–750

  • Lee Y, Park J (2020) Centermask: real-time anchor-free instance segmentation. In: The IEEE/CVF conference on computer vision and pattern recognition, 13–19 June 2020, Long Beach, pp 13906–13915

  • Liu S, Qi L, Qin H et al (2018) Path aggregation network for instance segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition, 18–22 June 2018, Salt Lake City, pp 8759–8768

  • Sandler M, Howard A, Zhu M, et al (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: the IEEE conference on computer vision and pattern recognition (CVPR), 18–22 June 2018, Salt Lake City, pp 4510–4520

  • Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556

  • Tan M, Le Q V (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, 25–28 January 2019, Taiyuan, pp 6105–6114

  • Tan M, Pang R, Le QV (2020) EfficientDet: scalable and efficient object detection. In: The IEEE/CVF conference on computer vision and pattern recognition, 13–19 June 2020, Long Beach, pp 10781–10790

  • Wang X, Kong T, Shen C et al (2020) SOLO: segmenting objects by locations. In: The European Conference on Computer Vision (ECCV), 23–28 August 2020, online, pp 649–665

  • Yang Z, Liu S, Hu H, Wang L (2019) Reppoints: point set representation for object detection. In: The IEEE International Conference on Computer Vision (ICCV), 27 October–2 November 2019, Korea, pp 9657–9666

  • Zhang H, Wu C, Zhang Z et al (2020) Resnest: split-attention networks. arXiv preprint arXiv: 2004.08955

Download references

Acknowledgements

The authors would like to thank all the participants taken part in the experiments. This work was supported in part by the National Science Foundation of China (Grant No. 61841701) and Fujian Vocational College Intelligent Equipment Application Technology Collaborative Innovation Center Construction Project (Grant No. 2016-7) and the Science and Technology Project from Transportation Department of FuJian Province (Grant No. 201934).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shaodan Lin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lin, S., Zhu, K., Feng, C. et al. Align-Yolact: a one-stage semantic segmentation network for real-time object detection. J Ambient Intell Human Comput 14, 863–870 (2023). https://doi.org/10.1007/s12652-021-03340-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-021-03340-4

Keywords

Navigation