Abstract
This paper addresses the robustness issue of information fusion for visual recognition. Analyzing limitations in existing fusion methods, we discover two key factors affecting the performance and robustness of a fusion model under different data distributions, namely (1) data dependency and (2) fusion assumption on posterior distribution. Considering these two factors, we develop a new framework to model dependency based on probabilistic properties of posteriors without any assumption on the data distribution. Making use of the range characteristics of posteriors, the fusion model is formulated as an analytic function multiplied by a constant with respect to the class label. With the analytic fusion model, we give an equivalent condition to the independent assumption and derive the dependency model from the marginal distribution property. Since the number of terms in the dependency model increases exponentially, the Reduced Analytic Dependency Model (RADM) is proposed based on the convergent property of analytic function. Finally, the optimal coefficients in the RADM are learned by incorporating label information from training data to minimize the empirical classification error under regularized least square criterion, which ensures the discriminative power. Experimental results from robust non-parametric statistical tests show that the proposed RADM method statistically significantly outperforms eight state-of-the-art score-level fusion methods on eight image/video datasets for different tasks of digit, flower, face, human action, object, and consumer video recognition.
Similar content being viewed by others
Notes
It should be noticed that significance in this paper refers to the statistical significance, but not the degree of improvement. In statistics, a result is called statistically significant, if the difference in an experiment is unlikely to be obtained by chance alone and is likely to be the result of a genuine experimental effect (Sheskin 2011).
References
Ahonen, T., Hadid, A., & Pietikäinen, M. (2004). Face recognition with local binary patterns. European Conference on Computer Vision, Lecture Notes in Computer Science, 3021, 469–481.
Awais, M., Yan, F., Mikolajczyk, K., & Kittler, J. (2011). Augmented kernel matrix vs classifier fusion for object recognition. British Machine Vision Conference, 60(1–60), 11.
Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 711–720.
Breukelen, M., Duin, R., Tax, D., & Hartog, J. (1998). Handwritten digit recognition by combined classifiers. Kybernetika, 34(4), 381–386.
Canu, S., Grandvalet, Y., Guigue, V., & Rakotomamonjy, A. (2005). SVM and kernel methods matlab toolbox. Rouen: Perception Systèmes et Information, INSA de Rouen.
Chen, H., & Meer, P. (2005). Robust fusion of uncertain information. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, 35(3), 578–586.
Comaniciu, D. (2003). Robust information fusion using variable-bandwidth density estimation. International Conference of Information Fusion, 2, 1303–1309.
Cover, T. M., & Thomas, J. A. (2006). Elements of information theory. New York: Wiley.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. IEEE Conference on Computer Vision and Pattern Recognition, 1, 886–8930.
Dass, S. C., Nandakumar, K., & Jain, A. K. (2005). A principled approach to score level fusion in multimodal biometric systems. In International conference on audio- and video-based biometric person authentication (pp. 1049–1058).
Demiriz, A., Bennett, K. P., & Shawe-Taylor, J. (2002). Linear programming boosting via column generation. Machine Learning, 46(1–3), 225–254.
Dems̆ar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30.
Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56, 52–64.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The PASCAL visual object classes challenge 2007 (VOC 2007) results.
Everingham, M., Gool, L. V., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Feller, W. (1968). An introduction to probability theory and its applications, volume I. New York: Wiley.
Fernando, B., Fromont, E., Muselet, D., & Sebban, M. (2012). Discriminative feature fusion for image classification. In IEEE conference on computer vision pattern recognition (pp. 3434–3441).
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32, 675–701.
Gehler, P., & Nowozin, S. (2009). On feature combination for multiclass object classification. In IEEE international conference on computer vision (pp. 221–228).
Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2247–2253.
Guillaumin, M., Verbeek, J., & Schmid, C. (2010). Multimodal semi-supervised learning for image classication. In IEEE conference on computer vision and pattern recognition (pp. 902–909).
He, H., & Cao, Y. (2012). SSC: A classifier combination method based on signal strength. IEEE Transactions on Neural Networks and Learning Systems, 23(7), 1100–1117.
He, M., Horng, S.-J., Fan, P., Run, R.-S., Chen, R.-J., Lai, J.-L., et al. (2010). Performance evaluation of score level fusion in multimodal biometric systems. Pattern Recognition, 43(5), 1789–1800.
He, X., Yan, S., Hu, Y., Niyogi, P., & Zhang, H. J. (2005). Face recognition using laplacianfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(3), 328–340.
Huber, P. J., & Ronchetti, E. M. (2009). Robust Statistics (2nd ed.). New York: Wiley.
Jain, A., Nandakumar, K., & Ross, A. (2005). Score normalization in multimodal biometric systems. Pattern Recognition, 38(12), 2270–2285.
Jiang, Y.-G., Ye, G., Chang, S.-F., Ellis, D., & Loui, A. C. (2011). Consumer video understanding: A benchmark database and an evaluation of human and machine performance. ACM International Conference on Multimedia Retrieval, 29(1–29), 8.
Kittler, J., Hatef, M., Duin, R. P. W., & Matas, J. (1998). On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3), 226–239.
Krantz, S. G., & Parks, H. R. (2002). A primer of real analytic functions. Basel: Birkhäuser.
Kuncheva, L. I. (2004). Combining pattern classifiers: Methods and algorithms. New York: Wiley.
Lan, X., Yuen, P. C., & Ma, A. J. (2014). Multi-cue visual tracking using robust feature-level fusion based on joint sparse representation. In IEEE conference on computer vision and pattern recognition.
Liu, D., Lai, K.-T., Ye, G., Chen, M.-S., & Chang, S.-F. (2013). Sample specific late fusion for visual category recognition. In IEEE conference on computer vision and pattern recognition (pp. 803–810).
Liu, J., McCloskey, S., & Liu, Y. (2012). Local expert forest of score fusion for video event classification. European Conference on Computer Vision, Lecture Notes in Computer Science, 7576, 397–410.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Luenberger, D. G., & Ye, Y. (2008). Linear and nonlinear programming (3rd ed.). Berlin: Springer.
Ma, A. J., & Yuen, P. C. (2012). Reduced analytical dependency modeling for classifier fusion. European Conference on Computer Vision, Lecture Notes in Computer Science, 7574, 792–805.
Ma, A. J., Yuen, P. C., & Lai, J.-H. (2013a). Linear dependency modeling for classifier fusion and feature combination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5), 1135–1148.
Ma, A. J., Yuen, P. C., Zou, W. W., & Lai, J.-H. (2013b). Supervised spatio-temporal neighborhood topology learning for action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 23(8), 1447–1460.
Mikolajczyk, K., & Schmid, C. (2004). Scale & affine invariant interest point detectors. International Journal of Computer Vision, 60(1), 63–86.
Mittal, A., Zisserman, A., & Torr, P. (2011). Hand detection using multiple proposals. British Machine Vision Conference, 75(1–75), 11.
Nandakumar, K., Chen, Y., Dass, S. C., & Jain, A. K. (2008). Likelihood ratio based biometric score fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 342–347.
Natarajan, P., Wu, S., Vitaladevuni, S., Zhuang, X., Tsakalidis, S., Park, U., et al. (2012). Multimodal feature fusion for robust event detection in web videos. In IEEE conference on computer vision pattern recognition (pp. 1298–1305).
Nilsback, M.-E., & Zisserman, A. (2006). A visual vocabulary for flower classification. IEEE Conference on Computer Vision and Pattern Recognition, 2, 1447–1454.
Nilsback, M.-E., & Zisserman, A. (2008). Automated flower classification over a large number of classes. In IEEE Indian conference on computer vision, graphics and image processing (pp. 722–729).
Oh, S., McCloskey, S., Kim, I., Vahdat, A., Cannons, K., Hajimirsadeghi, H., et al. (2014). Multimedia event detection with multimodal feature fusion and temporal concept localization. Machine Vision and Applications, 25(1), 49–69.
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.
Phillips, P. J., Moon, H., Rizvi, S. A., & Rauss, P. J. (2000). The FERET evaluation methodology for face recognition algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10), 1090–1104.
Prabhakar, S., & Jain, A. K. (2002). Decision-level fusion in fingerprint verification. Pattern Recognition, 35(4), 861–874.
Ross, A., Nandakumar, K., & Jain, A. K. (2006). Handbook of multibiometrics. Berlin: Springer.
Rudin, W. (1976). Principles of mathematical analysis. New York: McGraw-Hill.
Scheirer, W., Rocha, A., Micheals, R., & Boult, T. (2010). Robust fusion: Extreme value theory for recognition score normalization. European Conference on Computer Vision, Lecture Notes in Computer Science, 6313, 481–495.
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. IEEE International Conference on Pattern Recognition, 3, 32–36.
Sheskin, D. J. (2011). Handbook of parametric and nonparametric statistical procedures (5th ed.). London: Chapman and Hall/CRC.
Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall.
Sim, T., Baker, S., & Bsat, M. (2003). The CMU pose, illumination, and expression database. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(12), 1615–1618.
Tang, K., Yao, B., Fei-Fei, L., & Koller, D. (2013). Combining the right features for complex event recognition. In IEEE international conference on computer vision.
Terrades, O. R., Valveny, E., & Tabbone, S. (2009). Optimal classifier fusion in a non-bayesian probabilistic framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9), 1630–1644.
Toh, K.-A., Tran, Q.-L., & Srinivasan, D. (2004a). Benchmarking a reduced multivariate polynomial pattern classifier. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6), 740–755.
Toh, K.-A., Yau, W.-Y., & Jiang, X. (2004b). A reduced multivariate polynomial model for multimodal biometrics and classifiers fusion. IEEE Transactions on Circuits and Systems for Video Technology, 14(2), 224–233.
Ueda, N. (2000). Optimal linear combination of neural networks for improving classification performance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(2), 207–215.
Wang, H., Nie, F., & Huang, H. (2013). Heterogeneous visual features fusion via sparse multimodal machine. In IEEE conference on computer vision and pattern recognition (pp. 3097–3102).
Wang, J., Kwon, S., & Shim, B. (2012). Generalized orthogonal matching pursuit. IEEE Transactions on Signal Processing, 60(12), 6202–6216.
Ye, G., Liu, D., Jhuo, I.-H., & Chang, S.-F. (2012). Robust late fusion with rank minimization. In IEEE conference on computer vision pattern recognition (pp. 3021–3028).
Yuan, X.-T., Liu, X., & Yan, S. (2012). Visual classification with multitask joint sparse representation. IEEE Transactions on Image Processing, 21(10), 4349–4360.
Zhang, J., Marszałek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2), 213–238.
Acknowledgments
This project was partially supported by the Science Faculty Research Grant of Hong Kong Baptist University, Hong Kong Research Grants Council General Research Fund 212313, National Science Foundation of China Research Grant 61172136. The authors would like to thank the editor and reviewers for their helpful comments which improve the quality of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by K. Ikeuchi.
Appendices
Appendix A: Proof of Proposition 1
We first show that conditionally independent condition implies the solution to the equation system (16) is trivial, i.e. \({\varvec{a}}_{lm0} = \mathbf 0 , {\varvec{a}}_{lm2} = \mathbf 0 , {\varvec{a}}_{lm3} = \mathbf 0 , \ldots \) is a trivial solution to equation system (16) for \(m = 1, \ldots , M\). If feature representations are independent with each other given class label \(\omega _l\), the analytic function \(h_l({\varvec{s}}_l; {\varvec{a}}_l)\) becomes Eq. (3). Rewriting the analytic function in (3) according to the order of \(s_{lm}\), we get
where \(g_{lm1}(\tilde{{\varvec{s}}}_{lm}; {\varvec{a}}_{lm1}) = p_l^{1-M} \prod _{m' \ne m} s_{lm'}\). This Eq. (36) means that \(g_{lmn}(\tilde{{\varvec{s}}}_{lm}; {\varvec{a}}_{lmn}) \equiv 0\) or equivalently \({\varvec{a}}_{lmn} = \mathbf 0 \) for \(n \ne 1\), i.e. the solution to equation system (16) is trivial.
On the other hand, given the solution to equation system (16) is trivial, we need to show that the analytic function \(h_l({\varvec{s}}_l; {\varvec{a}}_l)\) is equal to Eq. (3). If \({\varvec{a}}_{lmn} = \mathbf 0 \) for \(n \ne 1\), then the analytic function \(h_l({\varvec{s}}_l; {\varvec{a}}_l)\) can be rewritten as Eq. (36) for \(m = 1, \ldots , M\). This implies each term in the power series \(h_l\) contains all variables \(s_{l1}, \ldots , s_{lM}\) and the order of each \(s_{lm}\) cannot be larger than one. In this case, there is only one non-zero term \(\prod _{m=1}^{M} s_{lm}\) in the analytic function \(h_l\). In addition, according to the normalization equation (15), the non-zero term \(\prod _{m=1}^{M} s_{lm}\) is normalized by the prior. And the analytic function becomes equation (3). This complete the proof of this proposition.
Appendix B: Derivation for \(E_{\mathrm{Dis}}({\varvec{a}}, {\varvec{q}})\)
Appendix C: Derivation of the Matrix Formulation for \(E({\varvec{a}})\)
Rights and permissions
About this article
Cite this article
Ma, A.J., Yuen, P.C. Reduced Analytic Dependency Modeling: Robust Fusion for Visual Recognition. Int J Comput Vis 109, 233–251 (2014). https://doi.org/10.1007/s11263-014-0723-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-014-0723-7