Abstract
In-line with the success of deep learning on traditional recognition problem, several end-to-end deep models for zero-shot recognition have been proposed in the literature. These models are successful to predict a single unseen label given an input image, but does not scale to cases where multiple unseen objects are present. In this paper, we model this problem within the framework of Multiple Instance Learning (MIL). To the best of our knowledge, we propose the first end-to-end trainable deep MIL framework for the multi-label zero-shot tagging problem. Due to its novel design, the proposed framework has several interesting features: (1) Unlike previous deep MIL models, it does not use any off-line procedure (e.g., Selective Search or EdgeBoxes) for bag generation. (2) During test time, it can process any number of unseen labels given their semantic embedding vectors. (3) Using only seen labels per image as weak annotation, it can produce a bounding box for each predicted label. We experiment with large-scale NUS-WIDE dataset and achieve superior performance across conventional, zero-shot and generalized zero-shot tagging tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Akata, Z., Reed, S., Walter, D., Lee, H., Schiele, B.: Evaluation of output embeddings for fine-grained image classification. In: CVPR, 07–12 June 2015, pp. 2927–2936 (2015)
Akata, Z., Malinowski, M., Fritz, M., Schiele, B.: Multi-cue zero-shot learning with strong supervision. In: CVPR, June 2016
Bourbaki, N.: Eléments de mathématiques: théorie des ensembles, chapitres 1 à 4, vol. 1. Masson (1990)
Chen, M., Zheng, A., Weinberger, K.Q.: Fast image tagging. In: ICML, January 2013
Cheng, M.M., Zhang, Z., Lin, W.Y., Torr, P.: Bing: binarized normed gradients for objectness estimation at 300fps. In: CVPR, pp. 3286–3293 (2014)
Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.T.: NUS-WIDE: a real-world web image database from National University of Singapore. In: CIVR, Santorini, Greece, 8–10 July 2009
Demirel, B., Gokberk Cinbis, R., Ikizler-Cinbis, N.: Attributes2classname: a discriminative model for attribute-based unsupervised zero-shot learning. In: ICCV, October 2017
Deutsch, S., Kolouri, S., Kim, K., Owechko, Y., Soatto, S.: Zero shot learning via multi-scale manifold regularization. In: CVPR, July 2017
Feng, J., Zhou, Z.H.: Deep MIML network. In: AAAI, pp. 1884–1890 (2017)
Fu, Y., Yang, Y., Hospedales, T., Xiang, T., Gong, S.: Transductive multi-label zero-shot learning. arXiv preprint arXiv:1503.07790 (2015)
Girshick, R.: Fast R-CNN. In: ICCV, December 2015
Gong, Y., Jia, Y., Leung, T., Toshev, A., Ioffe, S.: Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894 (2013)
Hassoun, M.H.: Fundamentals of Artificial Neural Networks. MIT Press, Cambridge (1995)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, vol. 2016, pp. 770–778, January 2016. Cited by 107
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. arXiv preprint arXiv:1605.06409 (2016)
Li, X., Liao, S., Lan, W., Du, X., Yang, G.: Zero-shot image tagging by hierarchical semantic embedding. In: RDIR, pp. 879–882. ACM (2015)
Li, Y., Wang, D., Hu, H., Lin, Y., Zhuang, Y.: Zero-shot recognition using dual visual-semantic mapping paths. In: CVPR, July 2017
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Mensink, T., Gavves, E., Snoek, C.G.: COSTA: co-occurrence statistics for zero-shot classification. In: CVPR, pp. 2441–2448 (2014)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Morgado, P., Vasconcelos, N.: Semantically consistent regularization for zero-shot recognition. In: CVPR, July 2017
Norouzi, M., et al.: Zero-shot learning by convex combination of semantic embeddings. In: ICLR (2014)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR, vol. 1, no. 2, p. 4 (2017)
Rahman, S., Khan, S., Porikli, F.: A unified approach for conventional zero-shot, generalized zero-shot, and few-shot learning. IEEE Trans. Image Process. 27(11), 5652–5667 (2018)
Rahman, S., Khan, S., Porikli, F.: Zero-shot object detection: learning to simultaneously recognize and localize novel concepts. In: Asian Conference on Computer Vision (ACCV). Springer, December 2018
Redmon, J., Farhadi, A.: Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242 (2016)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE TPAMI 39(6), 1137–1149 (2017)
Ren, Z., Jin, H., Lin, Z., Fang, C., Yuille, A.: Multiple instance visual-semantic embedding. In: BMVC (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., et al.: Going deeper with convolutions. In: CVPR, 07–12 June 2015, pp. 1–9 (2015)
Tang, P., Wang, X., Feng, B., Liu, W.: Learning multi-instance deep discriminative patterns for image classification. IEEE TIP 26(7), 3385–3396 (2017)
Tang, P., Wang, X., Huang, Z., Bai, X., Liu, W.: Deep patch learning for weakly supervised object classification and discovery. Pattern Recogn. 71, 446–459 (2017)
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. IJCV 104(2), 154–171 (2013)
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 dataset. Technical report, CNS-TR-2011-001, California Institute of Technology (2011)
Wang, X., Zhu, Z., Yao, C., Bai, X.: Relaxed multiple-instance SVM with application to object discovery, pp. 1224–1232 (2015)
Wei, Y., et al.: HCP: a flexible CNN framework for multi-label image classification. IEEE TPAMI 38(9), 1901–1907 (2016)
Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: CVPR, pp. 3460–3469, June 2015
Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., Schiele, B.: Latent embeddings for zero-shot classification. In: CVPR, June 2016
Xian, Y., Schiele, B., Akata, Z.: Zero-shot learning - the good, the bad and the ugly. In: CVPR (2017)
Zhang, L., Xiang, T., Gong, S.: Learning a deep embedding model for zero-shot learning. In: CVPR, July 2017
Zhang, Y., Gong, B., Shah, M.: Fast zero-shot image tagging. In: CVPR, June 2016
Zhou, Y., Sun, X., Liu, D., Zha, Z., Zeng, W.: Adaptive pooling in multi-instance learning for web video annotation. In: ICCV, October 2017
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_26
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Rahman, S., Khan, S. (2019). Deep Multiple Instance Learning for Zero-Shot Image Tagging. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11361. Springer, Cham. https://doi.org/10.1007/978-3-030-20887-5_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-20887-5_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20886-8
Online ISBN: 978-3-030-20887-5
eBook Packages: Computer ScienceComputer Science (R0)