Learning from PhotoShop Operation Videos: The PSOV Dataset

Cheng, Jingchun; Hsu, Han-Kai; Fang, Chen; Jin, Hailin; Wang, Shengjin; Yang, Ming-Hsuan

doi:10.1007/978-3-030-20870-7_14

Jingchun Cheng¹³,
Han-Kai Hsu¹²,
Chen Fang¹⁴,
Hailin Jin¹⁴,
Shengjin Wang¹³ &
…
Ming-Hsuan Yang¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11364))

Included in the following conference series:

Asian Conference on Computer Vision

1750 Accesses

Abstract

In this paper, we present the PhotoShop Operation Video (PSOV) dataset, a large-scale, densely annotated video database designed for the development of software intelligence. The PSOV dataset consists of 564 densely-annotated videos for Photoshop operations, covering more than 500 commonly used commands in the Photoshop software. Videos in this dataset are obtained from YouTube, manually watched and annotated precisely to seconds by experts. There are more than 74 h of videos with 29,204 labeled commands. To the best of our knowledge, the PSOV dataset is the first large-scale software operation video database with high-resolution frames and dense annotations. We believe that this dataset can help advance the development of intelligent software, and has extensive application aspects. In this paper, we describe the dataset construction procedure, data attributes, proposed tasks and their corresponding evaluation metrics. To demonstrate that the PSOV dataset has sufficient data and labeling for data-driven methods, we develop a deep learning based algorithm for the command classification task. We also carry out experiments and analysis with the proposed method to encourage better understanding and usage of the PSOV dataset.

J. Cheng and H.-K. Hsu—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
www.youtube.com.
2.
developers.google.com/youtube/v3/.
3.
www.upwork.com.
4.
www.expressjs.com.
5.
Video tube denotes a sequence of video frames which contains one specific command.
6.
R denotes recall, and N denotes the number of proposals averaged over the number of ground truth commands.

References

Abbeel, P., Coates, A., Ng, A.Y.: Autonomous helicopter aerobatics through apprenticeship learning. IJRR 29(13), 1608–1639 (2010)
Google Scholar
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)
Google Scholar
Chen, C., Seff, A., Kornhauser, A., Xiao, J.: DeepDriving: learning affordance for direct perception in autonomous driving. In: ICCV (2015)
Google Scholar
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: CVPR (2017)
Google Scholar
Cheng, J., et al.: Learning to segment instances in videos with spatial propagation network. arXiv preprint arXiv:1709.04609 (2017)
Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: SegFlow: joint learning for video object segmentation and optical flow. In: ICCV (2017)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Google Scholar
Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)
Google Scholar
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: CoRL (2017)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The Pascal visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)
Article Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. IJRR 32(11), 1231–1237 (2013)
Google Scholar
Gelly, S., Silver, D.: Achieving master level play in 9 \(\times \) 9 computer go. In: AAAI (2008)
Google Scholar
Gelly, S., Silver, D.: Monte-Carlo tree search and rapid action value estimation in computer go. Artif. Intill. 175(11), 1856–1875 (2011)
Article MathSciNet Google Scholar
Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: ActionVLAD: learning spatio-temporal aggregation for action classification. In: CVPR (2017)
Google Scholar
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. PAMI 35(1), 495–502 (2013)
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Google Scholar
Khoreva, A., Benenson, R., Ilg, E., Brox, T., Schiele, B.: Lucid data dreaming for multiple object tracking. arXiv preprint arXiv:1703.09554 (2017)
Kim, M., Kim, S., Park, S., Choi, M.T., Kim, M., Gomaa, H.: Service robot for the elderly. RAM 16(1), 34–45 (2009)
Google Scholar
Lefèvre, S., Carvalho, A., Gao, Y., Tseng, H.E., Borrelli, F.: Driver models for personalised driving assistance. VSD 53(12), 1705–1720 (2015)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Oh, J., Guo, X., Lee, H., Lewis, R.L., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: NIPS (2015)
Google Scholar
Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: EpicFlow: edge-preserving interpolation of correspondences for optical flow. In: CVPR (2015)
Google Scholar
Rhee, C., Chung, W., Kim, M., Shim, Y., Lee, H.: Door opening control using the multi-fingered robotic hand for the indoor service robot. In: ICRA (2004)
Google Scholar
Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)
Google Scholar
Shashua, A., Gdalyahu, Y., Hayun, G.: Pedestrian detection for driving assistance systems: single-frame classification and system level performance. In: IEEE Intelligent Vehicles Symposium, 2004, pp. 1–6. IEEE, June 2004
Google Scholar
Shi, T., Karpathy, A., Fan, L., Hernandez, J., Liang, P.: World of bits: an open-domain platform for web-based agents. In: ICML (2017)
Google Scholar
Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550, 354–359 (2017)
Article Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)
Google Scholar
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI (2017)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR (2012)
Google Scholar
Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of RGB-D SLAM systems. In: IROS (2012)
Google Scholar
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: CVPR (2018)
Google Scholar
Taggart, W., Turkle, S., Kidd, C.D.: An interactive robot in a nursing home: preliminary remarks. In: COGSCI Workshop (2005)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Google Scholar
Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. PAMI 40(6), 1510–1517 (2017)
Article Google Scholar
Voigtlaender, P., Leibe, B.: Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364 (2017)
Wang, J., Xiao, C., Zhu, T., Hsueh, C.H., Tseng, W.J., Wu, I.C.: Only-one-victor pattern learning in computer go. IEEE Trans. Comput. Intell. AI Games 9(1), 88–102 (2017)
Article Google Scholar
Yannakakis, G.N.: Game AI revisited. In: CF (2012)
Google Scholar
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)
Google Scholar

Download references

Acknowledgement

This work is supported in part by the NSF CAREER Grant #1149783, and gifts from Adobe.

Author information

Authors and Affiliations

University of California, Merced, Merced, USA
Han-Kai Hsu & Ming-Hsuan Yang
Tsinghua University, Beijing, China
Jingchun Cheng & Shengjin Wang
Adobe Research, San Jose, USA
Chen Fang & Hailin Jin

Authors

Jingchun Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Han-Kai Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Chen Fang
View author publications
You can also search for this author in PubMed Google Scholar
Hailin Jin
View author publications
You can also search for this author in PubMed Google Scholar
Shengjin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Hsuan Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shengjin Wang .

Editor information

Editors and Affiliations

IIIT Hyderabad, Hyderabad, India
C.V. Jawahar
ANU, Canberra, ACT, Australia
Hongdong Li
Simon Fraser University, Burnaby, BC, Canada
Greg Mori
ETH Zurich, Zurich, Zürich, Switzerland
Konrad Schindler

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1906 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, J., Hsu, HK., Fang, C., Jin, H., Wang, S., Yang, MH. (2019). Learning from PhotoShop Operation Videos: The PSOV Dataset. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11364. Springer, Cham. https://doi.org/10.1007/978-3-030-20870-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-20870-7_14
Published: 25 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20869-1
Online ISBN: 978-3-030-20870-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics