Attributes and Action Recognition Based on Convolutional Neural Networks and Spatial Pyramid VLAD Encoding

Yan, Shiyang; Smith, Jeremy S.; Zhang, Bailing

doi:10.1007/978-3-319-54526-4_37

Shiyang Yan¹⁶,
Jeremy S. Smith¹⁷ &
Bailing Zhang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10118))

Included in the following conference series:

Asian Conference on Computer Vision

3099 Accesses

Abstract

Determination of human attributes and recognition of actions in still images are two related and challenging tasks in computer vision, which often appear in fine-grained domains where the distinctions between the different categories are very small. Deep Convolutional Neural Network (CNN) models have demonstrated their remarkable representational learning capability through various examples. However, the successes are very limited for attributes and action recognition as the potential of CNNs to acquire both of the global and local information of an image remains largely unexplored. This paper proposes to tackle the problem with an encoding of a spatial pyramid Vector of Locally Aggregated Descriptors (VLAD) on top of CNN features. With region proposals generated by Edgeboxes, a compact and efficient representation of an image is thus produced for subsequent prediction of attributes and classification of actions. The proposed scheme is validated with competitive results on two benchmark datasets: 90.4% mean Average Precision (mAP) on the Berkeley Attributes of People dataset and 88.5% mAP on the Stanford 40 action dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), vol. 2, pp. 524–531 (2005)
Google Scholar
Csurka, G., Perronnin, F.: Fisher vectors: beyond bag-of-visual-words image representations. In: Richard, P., Braz, J. (eds.) VISIGRAPP 2010. CCIS, vol. 229, pp. 28–42. Springer, Heidelberg (2011). doi:10.1007/978-3-642-25382-9_2
Chapter Google Scholar
Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3304–3311. IEEE (2010)
Google Scholar
Sharma, G., Jurie, F., Schmid, C.: Discriminative spatial saliency for image classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3506–3513 (2012)
Google Scholar
Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: Proceedings of the British Machine Vision Conference, pp. 97.1–97.11. BMVA Press (2010). doi:10.5244/C.24.97.
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103, 60–79 (2013)
Article MathSciNet Google Scholar
Peng, X., Zou, C., Qiao, Y., Peng, Q.: Action recognition with stacked fisher vectors. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 581–595. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10602-1_38
Google Scholar
Lowe, D.G.: Object recognition from local scale-invariant features. In: The Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150–1157. IEEE (1999)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems (2012)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Google Scholar
Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R* CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1080–1088 (2015)
Google Scholar
Uricchio, T., Bertini, M., Seidenari, L., Bimbo, A.D.: Fisher encoded convolutional bag-of-windows for efficient image retrieval and social image tagging. In: 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), pp. 1020–1026 (2015)
Google Scholar
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10602-1_26
Google Scholar
Shin, A., Yamaguchi, M., Ohnishi, K., Harada, T.: Dense image representation with spatial pyramid VLAD coding of CNN for locally robust captioning. arXiv preprint arXiv:1603.09046 (2016)
Bourdev, L., Maji, S., Malik, J.: Describing people: a poselet-based approach to attribute classification. In: 2011 International Conference on Computer Vision, pp. 1543–1550 (2011)
Google Scholar
Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1331–1338. IEEE (2011)
Google Scholar
Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Attribute and simile classifiers for face verification. In: IEEE International Conference on Computer Vision (ICCV) (2009)
Google Scholar
Chen, H., Gallagher, A., Girod, B.: Describing clothing by semantic attributes. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 609–623. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33712-3_44
Chapter Google Scholar
Cai, J., Zha, Z.J., Zhou, W., Tian, Q.: Attribute-assisted reranking for web image retrieval. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 873–876. ACM (2012)
Google Scholar
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv preprint arXiv:1405.4506 (2014)
Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1817–1824 (2013)
Google Scholar
Ullah, M.M., Parizi, S.N., Laptev, I.: Improving bag-of-features action recognition with non-local cues. In: BMVC, vol. 10, pp. 95–1. Citeseer (2010)
Google Scholar
Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations (2010). http://www.di.ens.fr/willow/research/stillactions/
Sun, C., Nevatia, R.: Large-scale web video event classification by use of fisher vectors. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 15–22. IEEE (2013)
Google Scholar
Jain, M., Jégou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2555–2562 (2013)
Google Scholar
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1627–1645 (2010)
Article Google Scholar
Bourdev, L., Malik, J.: Poselets: body part detectors trained using 3D human pose annotations. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1365–1372. IEEE (2009)
Google Scholar
Zhang, N., Paluri, M., Ranzato, M., Darrell, T., Bourdev, L.: PANDA: pose aligned networks for deep attribute modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1637–1644 (2014)
Google Scholar
Yao, B., Fei-Fei, L.: Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1691–1703 (2012)
Article Google Scholar
Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34, 601–614 (2012)
Article Google Scholar
Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015)
Google Scholar
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.S.: Conditional random fields as recurrent neural networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1529–1537 (2015)
Google Scholar
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1717–1724 (2014)
Google Scholar
Gkioxari, G., Girshick, R., Malik, J.: Actions and attributes from wholes and parts. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2470–2478 (2015)
Google Scholar
Diba, A., Pazandeh, A.M., Pirsiavash, H., Van Gool, L.: DeepCAMP: deep convolutional action & attribute mid-level patterns. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 3557–3565 (2016). doi:10.1109/CVPR.2016.387
Arandjelovic, R., Zisserman, A.: All about VLAD. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1578–1585 (2013)
Google Scholar
Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. Int. J. Comput. Vis. 105, 222–245 (2013)
Article MathSciNet MATH Google Scholar
Dixit, M., Chen, S., Gao, D., Rasiwasia, N., Vasconcelos, N.: Scene classification with semantic fisher vectors. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2974–2983 (2015)
Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169–2178. IEEE (2006)
Google Scholar
Zhou, R., Yuan, Q., Gu, X., Zhang, D.: Spatial pyramid VLAD. In: 2014 IEEE Visual Communications and Image Processing Conference, pp. 342–345. IEEE (2014)
Google Scholar
Hosang, J., Benenson, R., Schiele, B.: How good are detection proposals, really? In: 25th British Machine Vision Conference, pp. 1–12. BMVA Press (2014)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
Google Scholar
Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: IEEE Conference on Computer Vision & Pattern Recognition (2010)
Google Scholar
Ross, D.A., Lim, J., Lin, R., Yang, M.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77, 125–141 (2008)
Article Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, pp. 675–678. ACM (2014)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Vedaldi, A., Fulkerson, B.: VLFeat: an open and portable library of computer vision algorithms (2008)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011). http://www.csie.ntu.edu.tw/ cjlin/libsvm
Article Google Scholar
Li, L.J., Su, H., Fei-Fei, L., Xing, E.P.: Object bank: a high-level image representation for scene classification & semantic feature sparsification. In: Advances in Neural Information Processing Systems, pp. 1378–1386 (2010)
Google Scholar
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3360–3367. IEEE (2010)
Google Scholar
Sharma, G., Jurie, F., Schmid, C.: Expanded parts model for human attribute and action recognition in still images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–659 (2013)
Google Scholar
Khan, F.S., Xu, J., van de Weijer, J., Bagdanov, A.D., Anwer, R.M., Lopez, A.M.: Recognizing actions through action-specific person detection. IEEE Trans. Image Process. 24, 4422–4432 (2015)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Xi’an Jiaotong-Liverpool University, Suzhou, China
Shiyang Yan & Bailing Zhang
University of Liverpool, Liverpool, UK
Jeremy S. Smith

Authors

Shiyang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Jeremy S. Smith
View author publications
You can also search for this author in PubMed Google Scholar
Bailing Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shiyang Yan .

Editor information

Editors and Affiliations

Institute of Information Science, Academia Sinica, Taipei, Taiwan
Chu-Song Chen
Tsinghua University , Beijing, China
Jiwen Lu
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore
Kai-Kuang Ma

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, S., Smith, J.S., Zhang, B. (2017). Attributes and Action Recognition Based on Convolutional Neural Networks and Spatial Pyramid VLAD Encoding. In: Chen, CS., Lu, J., Ma, KK. (eds) Computer Vision – ACCV 2016 Workshops. ACCV 2016. Lecture Notes in Computer Science(), vol 10118. Springer, Cham. https://doi.org/10.1007/978-3-319-54526-4_37

Download citation

DOI: https://doi.org/10.1007/978-3-319-54526-4_37
Published: 16 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54525-7
Online ISBN: 978-3-319-54526-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics