Abstract
Recent studies show that aggregating local descriptors into super vector yields effective representation for retrieval and classification tasks. A popular method along this line is vector of locally aggregated descriptors (VLAD), which aggregates the residuals between descriptors and visual words. However, original VLAD ignores high-order statistics of local descriptors and its dictionary may not be optimal for classification tasks. In this paper, we address these problems by utilizing high-order statistics of local descriptors and peforming supervised dictionary learning. The main contributions are twofold. Firstly, we propose a high-order VLAD (H-VLAD) for visual recognition, which leverages two kinds of high-order statistics in the VLAD-like framework, namely diagonal covariance and skewness. These high-order statistics provide complementary information for VLAD and allow for efficient computation. Secondly, to further boost the performance of H-VLAD, we design a supervised dictionary learning algorithm to discriminatively refine the dictionary, which can be also extended for other super vector based encoding methods. We examine the effectiveness of our methods in image-based object categorization and video-based action recognition. Extensive experiments on PASCAL VOC 2007, HMDB51, and UCF101 datasets exhibit that our method achieves the state-of-the-art performance on both tasks.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Arandjelovic, R., Zisserman, A.: All about VLAD. In: CVPR (2013)
Bengio, Y., Courville, A.C., Vincent, P.: Representation learning: A review and new perspectives. TPAMI 35(8) (2013)
Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. In: CVPR (2010)
Cai, Z., Wang, L., Peng, X., Qiao, Y.: Multi-view super vector for action recognition. In: CVPR (2014)
Chatfield, K., Lempitsky, V.S., Vedaldi, A., Zisserman, A.: The devil is in the details: An evaluation of recent feature encoding methods. In: BMVC (2011)
Delhumeau, J., Gosselin, P.H., Jégou, H., Pérez, P., et al.: Revisiting the vlad image representation. In: ACM MM (2013)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC 2007) Results (2007)
Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. CoRR abs/1403.1840 (2014)
Hogg, R.V., Craig, A.: Introduction to mathematical statistics (1994)
Jaakkola, T., Haussler, D., et al.: Exploiting generative models in discriminative classifiers. In: NIPS (1999)
Jain, M., Jégou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR (2013)
Jégou, H., Perronnin, F., Douze, M., Schmid, C., et al.: Aggregating local image descriptors into compact codes. TPAMI (2012)
Jia, Y., Darrell, T.: Heavy-tailed distances for gradient based image descriptors. In: NIPS (2011)
Jiang, Y.G., Liu, J., Roshan Zamir, A., Laptev, I., Piccardi, M., Shah, M., Sukthankar, R.: THUMOS challenge: Action recognition with a large number of classes (2013), http://crcv.ucf.edu/ICCV13-Action-Workshop/
Kobayashi, T.: BoF meets HOG: Feature extraction based on histograms of oriented pdf gradients for image classification. In: CVPR (2013)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: A large video database for human motion recognition. In: ICCV (2011)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11) (1998)
Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: ICCV (2011)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
Mihir, J., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR (2013)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR abs/1405.4506 (2014)
Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)
Russakovsky, O., Lin, Y., Yu, K., Fei-Fei, L.: Object-centric spatial pooling for image classification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 1–15. Springer, Heidelberg (2012)
Shi, F., Petriu, E., Laganiere, R.: Sampling strategies for real-time action recognition. In: CVPR (2013)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. CoRR abs/1406.2199 (2014)
Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: ICCV (2003)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv:1212.0402 (2012)
Sydorov, V., Sakurada, M., Lampert, C.H.: Deep fisher kernels - end to end learning of the fisher kernel gmm parameters. In: CVPR (2014)
Tariq, U., Yang, J., Huang, T.S.: Maximum margin gmm learning for facial expression recognition. In: FG Workshops (2013)
Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer vision algorithms (2008)
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV (2013)
Wang, H., Schmid, C., et al.: Action recognition with improved trajectories. In: ICCV (2013)
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: CVPR (2010)
Wang, L., Qiao, Y., Tang, X.: Motionlets: Mid-level 3D parts for human motion recognition. In: CVPR (2013)
Wang, X., Wang, L., Qiao, Y.: A comparative study of encoding, pooling and normalization methods for action recognition. In: ACCV (2012)
Wu, J., Zhang, Y., Lin, W.: Towards good practices for action video encoding. In: CVPR (2014)
Wu, R., Yu, Y., Wang, W.: Scale: Supervised and cascaded laplacian eigenmaps for visual object recognition based on nearest neighbors. In: CVPR (2013)
Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR (2009)
Zhou, X., Yu, K., Zhang, T., Huang, T.S.: Image classification using super-vector coding of local image descriptors. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 141–154. Springer, Heidelberg (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Peng, X., Wang, L., Qiao, Y., Peng, Q. (2014). Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8691. Springer, Cham. https://doi.org/10.1007/978-3-319-10578-9_43
Download citation
DOI: https://doi.org/10.1007/978-3-319-10578-9_43
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10577-2
Online ISBN: 978-3-319-10578-9
eBook Packages: Computer ScienceComputer Science (R0)