Skip to main content
Log in

Video content categorization using the double decomposition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Video contents contain complex structures due to the variety of the components and events involved. For example, surveillance videos often record multi-object interactions and consist of various scales of motion detail; Web videos are composed of multimodal cues, and each cue generally consists of a variety of scales of information. Generally, video contents comprise two types of the combination of the inherent structures: multi-modality/multi-scale and multi-object /multi-scale. Therefore, in this paper, we propose a new framework for video content modeling, under which video contents are decomposed into multiple interacting processes by double decomposition that aims at each type of combination of structures. To model the resulting processes, we propose a method named double-decomposed hidden Markov models (DDHMMs). DDHMMs contain multiple state chains that correspond to the interacting processes. To make the switching frequency of states in each chain consistent with the scale of the corresponding process, a durational state variable is introduced in DDHMMs. The proposed method performs well in modeling the relations among the interacting processes and the dynamics of each. We discuss the appropriate features under the proposed framework and evaluate DDHMMs in two applications, human motion recognition and web video categorization. The experimental results demonstrate that the double decomposition enhances video categorization performance in both cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Brand M, Oliver N, Pentland A (1997) Coupled hidden Markov models for complex action recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 994–999

  2. Brezeale D, Cook DJ (2008) Automatic video classification: a survey of the literature. IEEE Trans Syst Man Cybern C 38:416–430

    Article  Google Scholar 

  3. Chen C, Liang J, Zhu X (2011) Gait recognition based on improved dynamic Bayesian networks. Pattern Recogn 44:988–995

    Article  Google Scholar 

  4. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 886–893

  5. Duong TV, Bui HH, Phung DQ, Venkatesh S (2005) Activity recognition and abnormality detection with the switching hidden semi-Markov model. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 838–845

  6. Fine S, Singer Y, Tishby N (1998) The hierarchical hidden Markov model: analysis and applications. Mach Learn 32:41–62

    Article  MATH  Google Scholar 

  7. Forney GD (1973) The Viterbi algorithm. P IEEE 61:268–278

    Article  MathSciNet  Google Scholar 

  8. Ghahramani Z, Jordan MI (1997) Factorial hidden Markov models. Mach Learn 29:245–273

    Article  MATH  Google Scholar 

  9. Gu J, Ding X, Wang S, Wu Y (2010) Action and gait recognition from recovered 3-D human joints. IEEE Trans Syst Man Cybern B 40:1021–1033

    Article  Google Scholar 

  10. Huang CL, Shih HC, Chao CY (2006) Semantic analysis of soccer video using dynamic Bayesian network. IEEE Trans Multimedia 8:749–760

    Article  Google Scholar 

  11. Junejo IN (2010) Using dynamic Bayesian network for scene modeling and anomaly detection. Signal Image Video P 4:1–10

    Article  MATH  Google Scholar 

  12. Liu X, Chua CS (2006) Multi-agent activity recognition using observation decomposed hidden Markov models. Image Vis Comput 24:166–175

    Article  MATH  Google Scholar 

  13. Liu Y, Wu F (2009) Multi-modality video shot clustering with tensor representation. Multimed Tools Appl 41(1):93–109

    Article  Google Scholar 

  14. Manohar V, Tsakalidis S, Natarajan P, et al (2011) Audio-visual fusion using bayesian model combination for web video retrieval. In: Proceddings of ACM conference on multimedia, pp 1537–1540

  15. Mitchell C, Harper M, Jamieson L (1999) On the complexity of explicit duration HMMs. IEEE Trans Speech Audio Process 3(3):213–217

    Article  Google Scholar 

  16. Murphy KP (2002) Dynamic Bayesian network: representation, inference and learning. Ph.D Thesis, University of California, Berkeley

  17. Natarajan P, Nevatia R (2007) Coupled hidden semi-Markov models for activity recognition. In: Proceedings of IEEE workshop on motion and video computing, pp 10–17

  18. Nefian AV, Liang L, Pi X, et al (2002) A coupled HMM for audio-visual speech recognition. In: Proceedings of ICASSP, pp 2013–2016

  19. Niebles JC, Chen C, Li F (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: Proceddings of ECCV, pp 392–405

  20. Oliver N, Garg A, Horvitz E (2004) Layered representations for learning and inferring office activity from multiple sensory channels. Comput Vis Image Underst 96(2):163–180

    Article  Google Scholar 

  21. Roach MJ, Mason JSD, Pawlewski M (2001) Video genre classification using dynamics. In: Proceedings of ICASSP, pp 1557–1560

  22. Roweis S, Saul L (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326

    Article  Google Scholar 

  23. Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fusion in semantic video analysis. In: Proceedings of ACM international conference on multimedia, pp 399–402

  24. Tan BT, Fu M, Spray A, Dermody P (1996) The use of wavelet transforms in phoneme recognition. In: Proceedings of international conference on spoken language, pp 2431–2434

  25. Wang M, Hua X, Yuan X, Song Y, et al (2007) Optimizing multi-graph learning: towards a unified video annotation scheme. In: Proceedings of ACM international conference on multimedia, pp 862–871

  26. Wang L, Zhou H, Low S, Leckie C (2009) Action recognition via multi-feature fusion and gaussian process classification. In: Proceedings of workshop on applications of computer vision, pp 1–6

  27. Wu Y, Chang EY, Chang KCC, Smith JR (2004) Optimal multimodal fusion for multimedia data analysis. In: Proceedings of ACM international conference on multimedia, pp 572–579

  28. Yamato J, Ohya J, Ishii K (1992) Recognizing human action in time-sequential images using Hidden markov model. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 379–385

Download references

Acknowledgements

The research presented in this paper is supported in part by the National Natural Science Foundation (60905018, 60903121, 61173109, 61175039), Key Projects in the National Science & Technology Pillar Program (2011BAK08B02), Research Fund for Doctoral Program of Higher Education (20090201120032), Fundamental Research Funds for the Central Universities (xjj2009041, xjj20100051), of China. The authors would like to thank the video team at United Technologies Research Center (UTRC) for their pertinent and constructive discussion, and thank Dr. K.P. Murphy for his Matlab Bnet toolbox. Also, the authors would like to thank all the anonymous reviewers for their constructive advices.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xueming Qian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Du, Y., Chen, F., Xu, W. et al. Video content categorization using the double decomposition. Multimed Tools Appl 66, 545–572 (2013). https://doi.org/10.1007/s11042-012-1213-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-012-1213-y

Keywords

Navigation