Towards Balanced Learning for Instance Recognition

Pang, Jiangmiao; Chen, Kai; Li, Qi; Xu, Zhihai; Feng, Huajun; Shi, Jianping; Ouyang, Wanli; Lin, Dahua

doi:10.1007/s11263-021-01434-2

Towards Balanced Learning for Instance Recognition

Published: 05 February 2021

Volume 129, pages 1376–1393, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Jiangmiao Pang¹,
Kai Chen²,
Qi Li ORCID: orcid.org/0000-0002-1672-6362¹,
Zhihai Xu¹,
Huajun Feng¹,
Jianping Shi³,
Wanli Ouyang⁴ &
…
Dahua Lin²

1007 Accesses
17 Citations
2 Altmetric
Explore all metrics

Abstract

Instance recognition is rapidly advanced along with the developments of deep convolutional neural networks. Compared to the model architectures the training process, which is also crucial to the success of detectors, has received relatively less attention. In this work, we carefully revisit the standard training practice of detectors, and find that the detection performance is often limited by the imbalance during the training process, which generally consists in three levels—sample level, feature level, and objective level. To mitigate the adverse effects caused thereby, we propose Libra R-CNN, a simple yet effective framework towards balanced learning for instance recognition. It integrates IoU-balanced sampling, balanced feature pyramid, and objective re-weighting, respectively for reducing the imbalance at sample, feature, and objective level. Extensive experiments conducted on MS COCO, LVIS and Pascal VOC datasets prove the effectiveness of the overall balanced design.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constructing Balance from Imbalance for Long-Tailed Image Recognition

Increasing Oversampling Diversity for Long-Tailed Visual Recognition

SAFA: Sample-Adaptive Feature Augmentation for Long-Tailed Image Classification

References

Cai, Z., Fan, Q., Feris, R. S., & Vasconcelos, N. (2016). A unified multi-scale deep convolutional neural network for fast object detection. In European conference on computer vision.
Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: Delving into high quality object detection. In IEEE conference on computer vision and pattern recognition.
Cao, K., Wei, C., Gaidon, A., Arechiga, N., & Ma, T. (2019). Learning imbalanced datasets with label-distribution-aware margin loss. In Advances in neural information processing systems (pp. 1567–1578).
Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., & Ouyang, W., et al. (2019). Hybrid task cascade for instance segmentation. In IEEE conference on computer vision and pattern recognition.
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019). MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155.
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., & Urtasun, R. (2016). Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2147–2156).
Chen, X., Ma, H., Wan, J., Li, B., & Xia, T. (2017). Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1907–1915).
Cheng, T., Wang, X., Huang, L., & Liu, W. (2020). Boundary-preserving mask R-CNN. In: European conference on computer vision (pp. 660–676), Springer.
Cui, Y., Jia, M., Lin, T.Y., Song, Y., & Belongie, S. (2019). Class-balanced loss based on effective number of samples. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 9268–9277).
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 764–773).
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, IEEE (pp. 3354–3361).
Girshick, R. (2015). Fast R-CNN. In IEEE conference on computer vision and pattern recognition.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition.
Gupta, A., Dollar, P., & Girshick, R. (2019). LVIS: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5356–5364).
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In IEEE international conference on computer vision.
He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision.
Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., & Scholkopf, B. (1998). Support vector machines. IEEE Intelligent Systems and their Applications, 13(4), 18–28.
Article Google Scholar
Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In European conference on computer vision (pp. 340–353), Springer.
Hosang, J. H., Benenson, R., & Schiele, B. (2017). Learning non-maximum suppression. In IEEE conference on computer vision and pattern recognition.
Hu, H., Gu, J., Zhang, Z., Dai, J., & Wei, Y. (2018). Relation networks for object detection. In IEEE conference on computer vision and pattern recognition.
Huang, Z., Huang, L., Gong, Y., Huang, C., & Wang, X. (2019). Mask scoring R-CNN. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6409–6418).
Kang, B., Xie, S., Rohrbach, M., Yan, Z., Gordo, A., Feng, J., & Kalantidis, Y. (2019). Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217.
Kang, K., Ouyang, W., Li, H., & Wang, X. (2016). Object detection from video tubelets with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 817–825).
Khan, S. H., Hayat, M., Bennamoun, M., Sohel, F. A., & Togneri, R. (2017). Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Transactions on Neural Networks and Learning Systems, 29(8), 3573–3587.
Google Scholar
Kim, S. W., Kook, H. K., Sun, J. Y., Kang, M. C., & Ko, S. J. (2018). Parallel feature pyramid network for object detection. In Proceedings of the European conference on computer vision (ECCV) (pp. 234–250).
Kong, T., Sun, F., Liu, H., Jiang, Y., Li, L., & Shi, J. (2020). Foveabox: Beyound anchor-based object detection. IEEE Transactions on Image Processing.
Kong, T., Sun, F., Tan, C., Liu, H., & Huang, W. (2018). Deep feature pyramid reconfiguration for object detection. In European conference on computer vision.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).
Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. In European conference on computer vision.
Law, H., & Deng, J. (2019). Cornernet: Detecting objects as paired keypoints.
Li, B., Liu, Y., & Wang, X. (2019). Gradient harmonized single-stage detector. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, pp. 8577–8584).
Li, H., Liu, Y., Ouyang, W., & Wang, X. (2019). Zoom out-and-in network with map attention decision for region proposal and object detection. International Journal of Computer Vision, 127(3), 225–238.
Article Google Scholar
Li, Y., Wang, T., Kang, B., Tang, S., Wang, C., Li, J., & Feng, J. (2020). Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10991–11000).
Lin, T. Y., Dollár, P., Girshick, R. B., He, K., Hariharan, B., & Belongie, S. J. (2017). Feature pyramid networks for object detection. In IEEE conference on computer vision and pattern recognition.
Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2018). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327.
Article Google Scholar
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision.
Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., et al. (2019). Deep learning for generic object detection: A survey. International Journal of Computer Vision, 128, 261–318.
Article Google Scholar
Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. In IEEE conference on computer vision and pattern recognition.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., & Berg, A.C. (2016). Ssd: Single shot multibox detector. In European conference on computer vision.
Luo, W., Xing, J., Milan, A., Zhang, X., Liu, W., Zhao, X., & Kim, T. K. (2014). Multiple object tracking: A literature review. arXiv preprint arXiv:1409.7618.
Ouyang, W., Wang, K., Zhu, X., & Wang, X. (2017). Chained cascade network for object detection. In IEEE international conference on computer vision.
Pang, J., Chen, K., Shi, J., Feng, H., Ouyang, W., & Lin, D. (2019). Libra R-CNN: Towards balanced learning for object detection. In IEEE conference on computer vision and pattern recognition.
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In IEEE conference on computer vision and pattern recognition.
Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. arXiv preprint.
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems.
Shen, L., Lin, Z., & Huang, Q. (2016). Relay backpropagation for effective learning of deep convolutional neural networks. In European conference on computer vision (pp. 467–482), Springer.
Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In IEEE conference on computer vision and pattern recognition.
Singh, B., & Davis, L. S. (2018). An analysis of scale invariance in object detection–snip. In IEEE conference on computer vision and pattern recognition.
Singh, B., Najibi, M., & Davis, L. S. (2018). SNIPER: Efficient multi-scale training. In Advances in neural information processing systems.
Tan, J., Wang, C., Li, B., Li, Q., Ouyang, W., Yin, C., & Yan, J. (2020). Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11662–11671).
Tian, Z., Shen, C., Chen, H., & He, T. (2019). Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE international conference on computer vision (pp. 9627–9636).
Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.
Article Google Scholar
Wang, J., Chen, K., Yang, S., Loy, C. C., & Lin, D. (2019). Region proposal by guided anchoring. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2965–2974).
Wang, X., Girshick, R., Gupta, A., & He, K. (2017). Non-local neural networks, 10. arXiv preprint arXiv:1711.07971.
Wang, Y. X., Ramanan, D., & Hebert, M. (2017). Learning to model the tail. In Advances in neural information processing systems (pp. 7029–7039).
Wu, Y., & He, K. (2018). Group normalization. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19).
Wu, Y., Lim, J., & Yang, M. H. (2013). Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2411–2418).
Yang, Z., Liu, S., Hu, H., Wang, L., & Lin, S. (2019). Reppoints: Point set representation for object detection. arXiv preprint arXiv:1904.11490.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In European conference on computer vision.
Zeng, X., Ouyang, W., Yan, J., Li, H., Xiao, T., Wang, K., et al. (2018). Crafting GBD-Net for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(9), 2109–2123.
Article Google Scholar
Zhang, S., Chi, C., Yao, Y., Lei, Z., & Li, S. Z. (2020). Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9759–9768).
Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2017). Single-shot refinement neural network for object detection. arXiv preprint.
Zhang, S., Wen, L., Shi, H., Lei, Z., Lyu, S., & Li, S. Z. (2019). Single-shot scale-aware network for real-time face detection. International Journal of Computer Vision, 127(6–7), 537–559.
Article Google Scholar
Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., & Ling, H. (2019). M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, pp. 9259–9266).
Zhou, X., Wang, D., & Krähenbühl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850.
Zhou, X., Zhuo, J., & Krahenbuhl, P. (2019). Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 850–859).
Zhu, X., Pang, J., Yang, C., Shi, J., & Lin, D. (2019). Adapting object detectors via selective cross-domain alignment. In IEEE conference on computer vision and pattern recognition.
Zhu, X., Wang, Y., Dai, J., Yuan, L., & Wei, Y. (2017). Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE international conference on computer vision (pp. 408–417).
Zou, Z., Shi, Z., Guo, Y., & Ye, J. (2019). Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055.

Download references

Acknowledgements

This work is partially supported by National Natural Science Foundation of China (No. 61975175) the Civilian Fundamental Research (No. D040301), the Collaborative Research grant from SenseTime Group (CUHK Agreement No. TS1610626 & No. TS1712093), and the General Research Fund (GRF) of Hong Kong (No. 14236516 & No. 14203518).

Author information

Authors and Affiliations

Zhejiang University, Hangzhou, China
Jiangmiao Pang, Qi Li, Zhihai Xu & Huajun Feng
The Chinese University of Hong Kong, Hong Kong, China
Kai Chen & Dahua Lin
SenseTime Research, Beijing, China
Jianping Shi
The University of Sydney, Sydney, Australia
Wanli Ouyang

Authors

Jiangmiao Pang
View author publications
You can also search for this author in PubMed Google Scholar
Kai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Qi Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhihai Xu
View author publications
You can also search for this author in PubMed Google Scholar
Huajun Feng
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Shi
View author publications
You can also search for this author in PubMed Google Scholar
Wanli Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Dahua Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi Li.

Additional information

Communicated by S.-C. Zhu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Code is available at https://github.com/open-mmlab/mmdetection.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pang, J., Chen, K., Li, Q. et al. Towards Balanced Learning for Instance Recognition. Int J Comput Vis 129, 1376–1393 (2021). https://doi.org/10.1007/s11263-021-01434-2

Download citation

Received: 11 February 2020
Accepted: 09 January 2021
Published: 05 February 2021
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11263-021-01434-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards Balanced Learning for Instance Recognition

Abstract

Access this article

Similar content being viewed by others

Constructing Balance from Imbalance for Long-Tailed Image Recognition

Increasing Oversampling Diversity for Long-Tailed Visual Recognition

SAFA: Sample-Adaptive Feature Augmentation for Long-Tailed Image Classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards Balanced Learning for Instance Recognition

Abstract

Access this article

Similar content being viewed by others

Constructing Balance from Imbalance for Long-Tailed Image Recognition

Increasing Oversampling Diversity for Long-Tailed Visual Recognition

SAFA: Sample-Adaptive Feature Augmentation for Long-Tailed Image Classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation