MSDetector: A Static PHP Webshell Detection System Based on Deep-Learning

Cheng, Baijun; Guo, Yanhui; Ren, Yan; Yang, Gang; Xu, Guosheng

doi:10.1007/978-3-031-10363-6_11

Baijun Cheng⁹,
Yanhui Guo⁹,
Yan Ren¹⁰,
Gang Yang¹¹ &
…
Guosheng Xu⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13299))

Included in the following conference series:

International Symposium on Theoretical Aspects of Software Engineering

964 Accesses
1 Citations

Abstract

Webshell is a web script containing malicious code fragment, which hackers could use to launch web attacks. Hence, it is of great signifiance to identify whether a web script contains malicious code fragments in the aspect of web security. However, the flexibility of scripting language such as PHP provides attackers the opportunities to obfuscate scripts, making it challenging for traditional rule-based webshell detectors to detect malicious code fragments. Deep learning brings new ideas for webshell detection and improves the effect of detectors. However, the effect of deep learning-based detectors depends on feature engineering and deep learning models. The feature representations and models adopted by existing methods fail to mine the syntactic and semantic features of webshell scripts. To tackle those problems, we design a new code representation called script sequence according to the characteristics of webshell and also we introduce new pretrain task to enhance understanding of deep learning model to syntax information of webshell code. This leads to the design and implementation of Malicious Script Detector (MSDetector). In order to evaluate MSDetector, we present a new PHP webshell dataset. Experimental results prove that MSDetector can achieve higher F1 score and accuracy than other approaches on the dataset.

Supported by the National Natural Science Foundation of China (No.: 61873069).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics 5, 135–146 (2017). https://transacl.org/ojs/index.php/tacl/article/view/999
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Buratti, L., et al.: Exploring software naturalness through neural language models. CoRR abs/2006.12641 (2020). https://arxiv.org/abs/2006.12641
Chen, Y.: Convolutional neural network for sentence classification. Master’s thesis, University of Waterloo (2015)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation, pp. 1724–1734 (2014). https://doi.org/10.3115/v1/d14-1179
Fang, Y., Qiu, Y., Liu, L., Huang, C.: Detecting webshell based on random forest with fasttext. In: Proceedings of the 2018 International Conference on Computing and Artificial Intelligence, ICCAI 2018, Chengdu, China, 12–14 March 2018, pp. 52–56. ACM (2018). https://doi.org/10.1145/3194452.3194470
Feng, Z., et al.: CodeBERT: a pre-trained model for programming and natural languages. In: EMNLP 2020, pp. 1536–1547 (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.139
Guo, Y., Marco-Gisbert, H., Keir, P.: Mitigating webshell attacks through machine learning techniques. Future Internet 12(1), 12 (2020). https://doi.org/10.3390/fi12010012
Harer, J.A., et al.: Automated software vulnerability detection with machine learning. arXiv preprint arXiv:1803.04497 (2018)
Husain, H., Wu, H.H., Gazit, T., Allamanis, M., Brockschmidt, M.: CodeSearchNet challenge: evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019)
Iyer, S., Konstas, I., Cheung, A., Zettlemoyer, L.: Summarizing source code using a neural attention model. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, 7–12 August 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics (2016). https://doi.org/10.18653/v1/p16-1195
Li, T., Ren, C., Fu, Y., Xu, J., Guo, J., Chen, X.: Webshell detection based on the word attention mechanism. IEEE Access 7, 185140–185147 (2019). https://doi.org/10.1109/ACCESS.2019.2959950
Article Google Scholar
Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.S.: Gated graph sequence neural networks (2016). http://arxiv.org/abs/1511.05493
Lu, J., Tang, Z., Mao, J., Gu, Z., Zhang, J.: Mixed-models method based on machine learning in detecting webshell attack. In: CIPAE 2020: 2020 International Conference on Computers, Information Processing and Advanced Education, Ottawa, ON, Canada, 16–18 October 2020, pp. 251–259. ACM (2020). https://doi.org/10.1145/3419635.3419716
Mou, L., Li, G., Zhang, L., Wang, T., Jin, Z.: Convolutional neural networks over tree structures for programming language processing. In: Schuurmans, D., Wellman, M.P. (eds.) Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, 12–17 February 2016, Phoenix, Arizona, USA, pp. 1287–1293. AAAI Press (2016). http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/11775
Nguyen, N., Le, V., Phung, V., Du, P.: Toward a deep learning approach for detecting PHP webshell. In: Proceedings of the Tenth International Symposium on Information and Communication Technology, Ha Noi, Ha Long Bay, Vietnam, 4–6 December 2019, pp. 514–521. ACM (2019). https://doi.org/10.1145/3368926.3369733
Pappas, N., Popescu-Belis, A.: Multilingual hierarchical attention networks for document classification, pp. 1015–1025 (2017). https://aclanthology.org/I17-1102/
Roy, D., Panda, P., Roy, K.: Tree-CNN: a hierarchical deep convolutional neural network for incremental learning. Neural Netw. 121, 148–160 (2020). https://doi.org/10.1016/j.neunet.2019.09.010
Article Google Scholar
Sajnani, H., Saini, V., Svajlenko, J., Roy, C.K., Lopes, C.V.: SourcererCC: scaling code clone detection to big-code. In: Dillon, L.K., Visser, W., Williams, L.A. (eds.) Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, 14–22 May 2016, pp. 1157–1168. ACM (2016). https://doi.org/10.1145/2884781.2884877
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks, pp. 1556–1566 (2015). https://doi.org/10.3115/v1/p15-1150
Tao, F., Cao, C., Liu, Z.: Webshell detection model based on deep learning. In: Sun, X., Pan, Z., Bertino, E. (eds.) ICAIS 2019. LNCS, vol. 11635, pp. 408–420. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-24268-8_38
Chapter Google Scholar
Tu, T.D., Guang, C., Xiaojun, G., Wubin, P.: Webshell detection techniques in web applications. In: Fifth International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1–7. IEEE (2014)
Google Scholar
Vaswani, A., et al.: Attention is all you need, pp. 5998–6008 (2017). https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016). http://arxiv.org/abs/1609.08144
Xiao-Bo, X.U., Nie, X.M.: A method of detecting webshell based on multi-layer perception. Commun. Technol. 51, 895–900 (2018)
Google Scholar
Zhang, H., Xue, Z., Shi, Y.: Improved method of detecting webshell based on multi-layer perception. Commun. Technol. 52, 179–183 (2019)
Google Scholar
Zhang, J., Wang, X., Zhang, H., Sun, H., Wang, K., Liu, X.: A novel neural source code representation based on abstract syntax tree. In: Atlee, J.M., Bultan, T., Whittle, J. (eds.) Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, 25–31 May 2019, pp. 783–794. IEEE/ACM (2019). https://doi.org/10.1109/ICSE.2019.00086
Zhou, Y., Liu, S., Siow, J.K., Du, X., Liu, Y.: Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks, pp. 10197–10207 (2019). https://proceedings.neurips.cc/paper/2019/hash/49265d2447bc3bbfe9e76306ce40a31f-Abstract.html

Download references

Author information

Authors and Affiliations

School of Cyberspace Security, National Engineering Research Center of Mobile Network Security, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Baijun Cheng, Yanhui Guo & Guosheng Xu
QI-ANXIN Technology Group Inc., Beijing, China
Yan Ren
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, 100876, China
Gang Yang

Authors

Baijun Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Yanhui Guo
View author publications
You can also search for this author in PubMed Google Scholar
Yan Ren
View author publications
You can also search for this author in PubMed Google Scholar
Gang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Guosheng Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guosheng Xu .

Editor information

Editors and Affiliations

IRIT, Toulouse, France
Yamine Aït-Ameur
Babeș-Bolyai University, Cluj-Napoca, Romania
Florin Crăciun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, B., Guo, Y., Ren, Y., Yang, G., Xu, G. (2022). MSDetector: A Static PHP Webshell Detection System Based on Deep-Learning. In: Aït-Ameur, Y., Crăciun, F. (eds) Theoretical Aspects of Software Engineering. TASE 2022. Lecture Notes in Computer Science, vol 13299. Springer, Cham. https://doi.org/10.1007/978-3-031-10363-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-10363-6_11
Published: 03 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10362-9
Online ISBN: 978-3-031-10363-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics