Skip to main content

MSDetector: A Static PHP Webshell Detection System Based on Deep-Learning

  • Conference paper
  • First Online:
Theoretical Aspects of Software Engineering (TASE 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13299))

Included in the following conference series:

Abstract

Webshell is a web script containing malicious code fragment, which hackers could use to launch web attacks. Hence, it is of great signifiance to identify whether a web script contains malicious code fragments in the aspect of web security. However, the flexibility of scripting language such as PHP provides attackers the opportunities to obfuscate scripts, making it challenging for traditional rule-based webshell detectors to detect malicious code fragments. Deep learning brings new ideas for webshell detection and improves the effect of detectors. However, the effect of deep learning-based detectors depends on feature engineering and deep learning models. The feature representations and models adopted by existing methods fail to mine the syntactic and semantic features of webshell scripts. To tackle those problems, we design a new code representation called script sequence according to the characteristics of webshell and also we introduce new pretrain task to enhance understanding of deep learning model to syntax information of webshell code. This leads to the design and implementation of Malicious Script Detector (MSDetector). In order to evaluate MSDetector, we present a new PHP webshell dataset. Experimental results prove that MSDetector can achieve higher F1 score and accuracy than other approaches on the dataset.

Supported by the National Natural Science Foundation of China (No.: 61873069).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/CiscoCXSecurity/NeoPI.

  2. 2.

    http://www.shelldetector.com.

  3. 3.

    https://github.com/for-just-we/MSDetector.

  4. 4.

    https://www.antlr.org/.

  5. 5.

    https://github.com/JetBrains-Research/astminer.

  6. 6.

    https://joern.readthedocs.io/en/latest/.

References

  1. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics 5, 135–146 (2017). https://transacl.org/ojs/index.php/tacl/article/view/999

  2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  3. Buratti, L., et al.: Exploring software naturalness through neural language models. CoRR abs/2006.12641 (2020). https://arxiv.org/abs/2006.12641

  4. Chen, Y.: Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo (2015)

    Google Scholar 

  5. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation, pp. 1724–1734 (2014). https://doi.org/10.3115/v1/d14-1179

  6. Fang, Y., Qiu, Y., Liu, L., Huang, C.: Detecting webshell based on random forest with fasttext. In: Proceedings of the 2018 International Conference on Computing and Artificial Intelligence, ICCAI 2018, Chengdu, China, 12–14 March 2018, pp. 52–56. ACM (2018). https://doi.org/10.1145/3194452.3194470

  7. Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages. In: EMNLP 2020, pp. 1536–1547 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.139

  8. Guo, Y., Marco-Gisbert, H., Keir, P.: Mitigating webshell attacks through machine learning techniques. Future Internet 12(1), 12 (2020). https://doi.org/10.3390/fi12010012

  9. Harer, J.A., et al.: Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1803.04497 (2018)

  10. Husain, H., Wu, H.H., Gazit, T., Allamanis, M., Brockschmidt, M.: CodeSearchNet challenge: evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019)

  11. Iyer, S., Konstas, I., Cheung, A., Zettlemoyer, L.: Summarizing source code using a neural attention model. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, 7–12 August 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics (2016). https://doi.org/10.18653/v1/p16-1195

  12. Li, T., Ren, C., Fu, Y., Xu, J., Guo, J., Chen, X.: Webshell detection based on the word attention mechanism. IEEE Access 7, 185140–185147 (2019). https://doi.org/10.1109/ACCESS.2019.2959950

    Article  Google Scholar 

  13. Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.S.: Gated graph sequence neural networks (2016). http://arxiv.org/abs/1511.05493

  14. Lu, J., Tang, Z., Mao, J., Gu, Z., Zhang, J.: Mixed-models method based on machine learning in detecting webshell attack. In: CIPAE 2020: 2020 International Conference on Computers, Information Processing and Advanced Education, Ottawa, ON, Canada, 16–18 October 2020, pp. 251–259. ACM (2020). https://doi.org/10.1145/3419635.3419716

  15. Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: Schuurmans, D., Wellman, M.P. (eds.) Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 12–17 February 2016, Phoenix, Arizona, USA, pp. 1287–1293. AAAI Press (2016). http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11775

  16. Nguyen, N., Le, V., Phung, V., Du, P.: Toward a deep learning approach for detecting PHP webshell. In: Proceedings of the Tenth International Symposium on Information and Communication Technology, Ha Noi, Ha Long Bay, Vietnam, 4–6 December 2019, pp. 514–521. ACM (2019). https://doi.org/10.1145/3368926.3369733

  17. Pappas, N., Popescu-Belis, A.: Multilingual hierarchical attention networks for document classification, pp. 1015–1025 (2017). https://aclanthology.org/I17-1102/

  18. Roy, D., Panda, P., Roy, K.: Tree-CNN: a hierarchical deep convolutional neural network for incremental learning. Neural Netw. 121, 148–160 (2020). https://doi.org/10.1016/j.neunet.2019.09.010

    Article  Google Scholar 

  19. Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., Lopes, C.V.: SourcererCC: scaling code clone detection to big-code. In: Dillon, L.K., Visser, W., Williams, L.A. (eds.) Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, 14–22 May 2016, pp. 1157–1168. ACM (2016). https://doi.org/10.1145/2884781.2884877

  20. Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks, pp. 1556–1566 (2015). https://doi.org/10.3115/v1/p15-1150

  21. Tao, F., Cao, C., Liu, Z.: Webshell detection model based on deep learning. In: Sun, X., Pan, Z., Bertino, E. (eds.) ICAIS 2019. LNCS, vol. 11635, pp. 408–420. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-24268-8_38

    Chapter  Google Scholar 

  22. Tu, T.D., Guang, C., Xiaojun, G., Wubin, P.: Webshell detection techniques in web applications. In: Fifth International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1–7. IEEE (2014)

    Google Scholar 

  23. Vaswani, A., et al.: Attention is all you need, pp. 5998–6008 (2017). https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

  24. Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144

  25. Xiao-Bo, X.U., Nie, X.M.: A method of detecting webshell based on multi-layer perception. Commun. Technol. 51, 895–900 (2018)

    Google Scholar 

  26. Zhang, H., Xue, Z., Shi, Y.: Improved method of detecting webshell based on multi-layer perception. Commun. Technol. 52, 179–183 (2019)

    Google Scholar 

  27. Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., Liu, X.: A novel neural source code representation based on abstract syntax tree. In: Atlee, J.M., Bultan, T., Whittle, J. (eds.) Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, 25–31 May 2019, pp. 783–794. IEEE/ACM (2019). https://doi.org/10.1109/ICSE.2019.00086

  28. Zhou, Y., Liu, S., Siow, J.K., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks, pp. 10197–10207 (2019). https://proceedings.neurips.cc/paper/2019/hash/49265d2447bc3bbfe9e76306ce40a31f-Abstract.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guosheng Xu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cheng, B., Guo, Y., Ren, Y., Yang, G., Xu, G. (2022). MSDetector: A Static PHP Webshell Detection System Based on Deep-Learning. In: Aït-Ameur, Y., Crăciun, F. (eds) Theoretical Aspects of Software Engineering. TASE 2022. Lecture Notes in Computer Science, vol 13299. Springer, Cham. https://doi.org/10.1007/978-3-031-10363-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-10363-6_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-10362-9

  • Online ISBN: 978-3-031-10363-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics