Skip to main content

Learning from PhotoShop Operation Videos: The PSOV Dataset

  • Conference paper
  • First Online:
Computer Vision – ACCV 2018 (ACCV 2018)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11364))

Included in the following conference series:

  • 1750 Accesses

Abstract

In this paper, we present the PhotoShop Operation Video (PSOV) dataset, a large-scale, densely annotated video database designed for the development of software intelligence. The PSOV dataset consists of 564 densely-annotated videos for Photoshop operations, covering more than 500 commonly used commands in the Photoshop software. Videos in this dataset are obtained from YouTube, manually watched and annotated precisely to seconds by experts. There are more than 74 h of videos with 29,204 labeled commands. To the best of our knowledge, the PSOV dataset is the first large-scale software operation video database with high-resolution frames and dense annotations. We believe that this dataset can help advance the development of intelligent software, and has extensive application aspects. In this paper, we describe the dataset construction procedure, data attributes, proposed tasks and their corresponding evaluation metrics. To demonstrate that the PSOV dataset has sufficient data and labeling for data-driven methods, we develop a deep learning based algorithm for the command classification task. We also carry out experiments and analysis with the proposed method to encourage better understanding and usage of the PSOV dataset.

J. Cheng and H.-K. Hsu—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    www.youtube.com.

  2. 2.

    developers.google.com/youtube/v3/.

  3. 3.

    www.upwork.com.

  4. 4.

    www.expressjs.com.

  5. 5.

    Video tube denotes a sequence of video frames which contains one specific command.

  6. 6.

    R denotes recall, and N denotes the number of proposals averaged over the number of ground truth commands.

References

  1. Abbeel, P., Coates, A., Ng, A.Y.: Autonomous helicopter aerobatics through apprenticeship learning. IJRR 29(13), 1608–1639 (2010)

    Google Scholar 

  2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  3. Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)

    Google Scholar 

  4. Chen, C., Seff, A., Kornhauser, A., Xiao, J.: DeepDriving: learning affordance for direct perception in autonomous driving. In: ICCV (2015)

    Google Scholar 

  5. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: CVPR (2017)

    Google Scholar 

  6. Cheng, J., et al.: Learning to segment instances in videos with spatial propagation network. arXiv preprint arXiv:1709.04609 (2017)

  7. Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: SegFlow: joint learning for video object segmentation and optical flow. In: ICCV (2017)

    Google Scholar 

  8. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)

    Google Scholar 

  10. Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)

    Google Scholar 

  11. Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: CoRL (2017)

    Google Scholar 

  12. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)

    Article  Google Scholar 

  13. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. IJRR 32(11), 1231–1237 (2013)

    Google Scholar 

  14. Gelly, S., Silver, D.: Achieving master level play in 9 \(\times \) 9 computer go. In: AAAI (2008)

    Google Scholar 

  15. Gelly, S., Silver, D.: Monte-Carlo tree search and rapid action value estimation in computer go. Artif. Intill. 175(11), 1856–1875 (2011)

    Article  MathSciNet  Google Scholar 

  16. Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: CVPR (2017)

    Google Scholar 

  17. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)

    Google Scholar 

  18. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. PAMI 35(1), 495–502 (2013)

    Article  Google Scholar 

  19. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)

    Google Scholar 

  20. Khoreva, A., Benenson, R., Ilg, E., Brox, T., Schiele, B.: Lucid data dreaming for multiple object tracking. arXiv preprint arXiv:1703.09554 (2017)

  21. Kim, M., Kim, S., Park, S., Choi, M.T., Kim, M., Gomaa, H.: Service robot for the elderly. RAM 16(1), 34–45 (2009)

    Google Scholar 

  22. Lefèvre, S., Carvalho, A., Gao, Y., Tseng, H.E., Borrelli, F.: Driver models for personalised driving assistance. VSD 53(12), 1705–1720 (2015)

    Google Scholar 

  23. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  24. Oh, J., Guo, X., Lee, H., Lewis, R.L., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: NIPS (2015)

    Google Scholar 

  25. Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: EpicFlow: edge-preserving interpolation of correspondences for optical flow. In: CVPR (2015)

    Google Scholar 

  26. Rhee, C., Chung, W., Kim, M., Shim, Y., Lee, H.: Door opening control using the multi-fingered robotic hand for the indoor service robot. In: ICRA (2004)

    Google Scholar 

  27. Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)

    Google Scholar 

  28. Shashua, A., Gdalyahu, Y., Hayun, G.: Pedestrian detection for driving assistance systems: single-frame classification and system level performance. In: IEEE Intelligent Vehicles Symposium, 2004, pp. 1–6. IEEE, June 2004

    Google Scholar 

  29. Shi, T., Karpathy, A., Fan, L., Hernandez, J., Liang, P.: World of bits: an open-domain platform for web-based agents. In: ICML (2017)

    Google Scholar 

  30. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550, 354–359 (2017)

    Article  Google Scholar 

  31. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)

    Google Scholar 

  32. Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI (2017)

    Google Scholar 

  33. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR (2012)

    Google Scholar 

  34. Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of RGB-D SLAM systems. In: IROS (2012)

    Google Scholar 

  35. Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: CVPR (2018)

    Google Scholar 

  36. Taggart, W., Turkle, S., Kidd, C.D.: An interactive robot in a nursing home: preliminary remarks. In: COGSCI Workshop (2005)

    Google Scholar 

  37. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)

    Google Scholar 

  38. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. PAMI 40(6), 1510–1517 (2017)

    Article  Google Scholar 

  39. Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364 (2017)

  40. Wang, J., Xiao, C., Zhu, T., Hsueh, C.H., Tseng, W.J., Wu, I.C.: Only-one-victor pattern learning in computer go. IEEE Trans. Comput. Intell. AI Games 9(1), 88–102 (2017)

    Article  Google Scholar 

  41. Yannakakis, G.N.: Game AI revisited. In: CF (2012)

    Google Scholar 

  42. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)

    Google Scholar 

Download references

Acknowledgement

This work is supported in part by the NSF CAREER Grant #1149783, and gifts from Adobe.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shengjin Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1906 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cheng, J., Hsu, HK., Fang, C., Jin, H., Wang, S., Yang, MH. (2019). Learning from PhotoShop Operation Videos: The PSOV Dataset. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11364. Springer, Cham. https://doi.org/10.1007/978-3-030-20870-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-20870-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20869-1

  • Online ISBN: 978-3-030-20870-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics