Composition of Ensembles of Recurrent Neural Networks for Phishing Websites Detection

Vaitkevicius, Paulius; Marcinkevicius, Virginijus

doi:10.1007/978-3-030-57672-1_22

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1243))

Included in the following conference series:

International Baltic Conference on Databases and Information Systems

543 Accesses
3 Citations

Abstract

Phishing remains a continual security threat, causing global losses exceeding 3.5 billion USD in 2019, according to the FBI’s Internet Crime Complaint Center. The Anti-Phishing Working Group (APWG) reported as many as 2,172 unique phishing websites detected per day in 2019. Most of the methods to solve the phishing websites’ detection problem proposed by the scientific community are based on classical classification algorithms on phishing datasets with hand-extracted features. Although these methods demonstrate high accuracies, unfortunately, they are sensitive to changing environment: phishers can learn the most relevant URL features and adapt their attacks to overcome the security check. Therefore, in search of less sensitive methods, deep neural networks were started to employ, as they do not require manual feature extraction and can directly learn a representation from the URL’s sequence of characters. The purpose of this research is to propose a new method for phishing websites’ URL detection based on ensembles of Recurrent neural networks and other types of deep neural networks. The results of our approach are presented in this paper and compared with the performance of other Recurrent neural networks. These results are additionally compared with the performance of classical classification algorithms on the same dataset with 48 features extracted. Our method with no manually extracted feature gives a significant increase in classification accuracy, compared with single Recurrent neural networks, and matches the accuracy of classical classification ensembles with manually extracted features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Adebowale, M., Lwin, K., Sánchez, E., Hossain, M.: Intelligent web-phishing detection and protection scheme using integrated features of Images, frames and text. Expert Systems with Applications 115, 300–313 (2019). https://doi.org/10.1016/J.ESWA.2018.07.067, https://www.sciencedirect.com/science/article/pii/S0957417418304925?via%3Dihub
Anti-Phishing Working Group, I.: Phishing Activity Trends Reports (2019). https://apwg.org/resources/apwg-reports/
Bahnsen, A.C., Bohorquez, E.C., Villegas, S., Vargas, J., Gonzalez, F.A.: Classifying phishing URLs using recurrent neural networks. In: 2017 APWG Symposium on Electronic Crime Research (eCrime), pp. 1–8 (2017). https://doi.org/10.1109/ECRIME.2017.7945048, http://ieeexplore.ieee.org/document/7945048/
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994). https://doi.org/10.1109/72.279181
Article Google Scholar
Chiew, K.L., Tan, C.L., Wong, K., Yong, K.S., Tiong, W.K.: A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Information Sciences 484, 153–166 (2019). https://doi.org/10.1016/j.ins.2019.01.064, https://www.sciencedirect.com/science/article/pii/S0020025519300763?via%3Dihub linkinghub.elsevier.com/retrieve/pii/S0020025519300763
Cho, K., et al.: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078v3 (2014)
Cui, B., He, S., Yao, X., Shi, P., Yao, X., He, S., Cui, B.: Malicious URL detection with feature extraction based on machine learning. Int. J. High Performance Comput. Network. 12(2), 166 (2018). https://doi.org/10.1504/ijhpcn.2018.10015545, http://www.inderscience.com/link.php?id=94367
Gers, F.A., Urgen Schmidhuber, J.J., Cummins, F.: Learning to forget: continual prediction with LSTM. In: Proceedings ICANN 1999 International Conference on Artificial Neural Network, vol. 2, pp. 850–855. IDSIA (1999). http://www.idsia.ch/www.idsia.ch/
Han, J., Moraga, C.: The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 930, pp. 195–201. Springer, Cham (1995). https://doi.org/10.1007/3-540-59497-3_175
Hochreiter, S., Urgen Schmidhuber, J.J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997), http://www7.informatik.tu-muenchen.de/~hochreitwww.idsia.ch/~juergen
Internet Crime Complaint Center: Internet Crime Report 2019. Tech. rep., Internet Crime Complaint Center at the Federal Bureau of Investigation of United States of America (2020). https://www.ic3.gov/media/annualreport/2019_IC3Report.pdf
Kleinbaum, D.G., Klein, M.: Introduction to logistic regression. In: Logistic Regression, pp. 1–39. Springer, New York, NY (2010). https://doi.org/10.1007/978-1-4419-1742-3_1, http://link.springer.com/10.1007/978-1-4419-1742-3_1
Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: ECML 1998: Machine Learning: ECML-1998, pp. 4–15. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026666
Lin Tan, C., et al.: PhishWHO: Phishing webpage detection via identity keywords extraction and target domain name finder. Decision Support Systems 88, 18–27 (2016). https://doi.org/10.1016/j.dss.2016.05.005
Marchal, S., Armano, G., Grondahl, T., Saari, K., Singh, N., Asokan, N.: Off-the-hook: an efficient and usable client-side phishing prevention application. IEEE Trans. Comput. 66(10), 1717–1733 (2017). https://doi.org/10.1109/TC.2017.2703808
Article MathSciNet Google Scholar
Opara, C., Wei, B., Chen, Y.: HTMLPhish: Enabling Accurate Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis. http://arxiv.org/abs/1909.01135arXiv:1909.01135 (2019), http://www.phishtank.com
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011), http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html, https://scikit-learn.org/stable/
Saxe, J., Berlin, K.: eXpose: A character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys. arXiv preprint arXiv:1702.08568, February 2017, http://arxiv.org/abs/1702.08568
Seifert, C., Welch, I., Komisarczuk, P.: Identification of malicious web pages with static heuristics. In: 2008 Australasian Telecommunication Networks and Applications Conference, pp. 91–96. IEEE, December 2008. https://doi.org/10.1109/ATNAC.2008.4783302, http://ieeexplore.ieee.org/document/4783302/
Shapiro, S.S., Wilk, M.B.: An analysis of variance test for normality (complete samples). Biometrika 52(3/4), 591–611 (1965)
Article MathSciNet Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Vaitkevicius, P., Marcinkevicius, V.: Comparison of classification algorithms for detection of phishing websites. Informatica 31(1), 143–160 (2020). https://doi.org/10.15388/20-infor404
Article MathSciNet Google Scholar
Vazhayil, A., Vinayakumar, R., Soman, K.: Comparative study of the detection of malicious URLs using shallow and deep networks. In: 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT). pp. 1–6. IEEE, July 2018. https://doi.org/10.1109/ICCCNT.2018.8494159, https://ieeexplore.ieee.org/document/8494159/
Verma, R., Das, A.: What’s in a URL. In: Proceedings of the 3rd ACM on International Workshop on Security And PrivacyAnalytics - IWSPA 2017, pp. 55–63. ACM Press, New York, New York (2017). https://doi.org/10.1145/3041008.3041016, http://dl.acm.org/citation.cfm?doid=3041008.3041016
Wei, B., Hamad, R.A., Yang, L., He, X., Wang, H., Gao, B., Woo, W.L.: A deep-learning-driven light-weight phishing detection sensor. Sensors 19(19), 4258 (2019). https://doi.org/10.3390/s19194258
Article Google Scholar
Whittaker, C., Ryner, B., Nazif, M.: Large-scale automatic classification of phishing pages. The 17th Annual Network and Distributed System Security Symposium (NDSS 2010) (2010). https://doi.org/10.1109/TDSC.2013.3, http://www.isoc.org/isoc/conferences/ndss/10/pdf/08.pdf%5Cnresearch.google.com/pubs/pub35580.html
Xiang, G., Hong, J., Rose, C.P., Cranor, L.: CANTINA+: A feature-rich machine learning framework for detecting phishing web sites. ACM Trans. Inf. Syst. Secur. 14(2), 1–28 (2011). https://doi.org/10.1145/2019599.2019606, https://www.ml.cmu.edu/research/dap-papers/dap-guang-xiang.pdf
Yang, P., Zhao, G., Zeng, P.: Phishing website detection based on multidimensional features driven by deep learning. IEEE Access 7, 15196–15209 (2019). https://doi.org/10.1109/ACCESS.2019.2892066
Article Google Scholar
Zhao, J., Wang, N., Ma, Q., Cheng, Z.: Classifying malicious URLs using gated recurrent neural networks. In: International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, pp. 385–394. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-319-93554-6_36
Zhao, P., Hoi, S.C.: Cost-sensitive online active learning with application to malicious URL detection. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2013, p. 919. ACM Press, New York (2013). https://doi.org/10.1145/2487575.2487647, http://dl.acm.org/citation.cfm?doid=2487575.2487647

Download references

Author information

Authors and Affiliations

Vilnius University Institute of Data Science and Digital Technologies, Vilnius, Lithuania
Paulius Vaitkevicius & Virginijus Marcinkevicius

Authors

Paulius Vaitkevicius
View author publications
You can also search for this author in PubMed Google Scholar
Virginijus Marcinkevicius
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paulius Vaitkevicius .

Editor information

Editors and Affiliations

Department of Computer Systems, Tallinn University of Technology, Tallinn, Estonia
Tarmo Robal
Department of Software Science, Tallinn University of Technology, Tallinn, Estonia
Hele-Mai Haav
Department of Software Science, Tallinn University of Technology, Tallinn, Estonia
Jaan Penjam
Institute of Computer Science, University of Tartu, Tartu, Estonia
Raimundas Matulevičius

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vaitkevicius, P., Marcinkevicius, V. (2020). Composition of Ensembles of Recurrent Neural Networks for Phishing Websites Detection. In: Robal, T., Haav, HM., Penjam, J., Matulevičius, R. (eds) Databases and Information Systems. DB&IS 2020. Communications in Computer and Information Science, vol 1243. Springer, Cham. https://doi.org/10.1007/978-3-030-57672-1_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-57672-1_22
Published: 12 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57671-4
Online ISBN: 978-3-030-57672-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics