Skip to main content

Composition of Ensembles of Recurrent Neural Networks for Phishing Websites Detection

  • Conference paper
  • First Online:
Databases and Information Systems (DB&IS 2020)

Abstract

Phishing remains a continual security threat, causing global losses exceeding 3.5 billion USD in 2019, according to the FBI’s Internet Crime Complaint Center. The Anti-Phishing Working Group (APWG) reported as many as 2,172 unique phishing websites detected per day in 2019. Most of the methods to solve the phishing websites’ detection problem proposed by the scientific community are based on classical classification algorithms on phishing datasets with hand-extracted features. Although these methods demonstrate high accuracies, unfortunately, they are sensitive to changing environment: phishers can learn the most relevant URL features and adapt their attacks to overcome the security check. Therefore, in search of less sensitive methods, deep neural networks were started to employ, as they do not require manual feature extraction and can directly learn a representation from the URL’s sequence of characters. The purpose of this research is to propose a new method for phishing websites’ URL detection based on ensembles of Recurrent neural networks and other types of deep neural networks. The results of our approach are presented in this paper and compared with the performance of other Recurrent neural networks. These results are additionally compared with the performance of classical classification algorithms on the same dataset with 48 features extracted. Our method with no manually extracted feature gives a significant increase in classification accuracy, compared with single Recurrent neural networks, and matches the accuracy of classical classification ensembles with manually extracted features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.phishtank.com/.

  2. 2.

    https://developers.google.com/safe-browsing/.

  3. 3.

    As described in https://www.tensorflow.org/api_docs/python/tf/keras/layers/SimpleRNNCell.

  4. 4.

    https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout.

  5. 5.

    http://rasbt.github.io/mlxtend/.

  6. 6.

    https://data.mendeley.com/datasets/h3cgnj8hft/1.

  7. 7.

    https://www.tensorflow.org.

  8. 8.

    https://scikit-learn.org/.

References

  1. Adebowale, M., Lwin, K., Sánchez, E., Hossain, M.: Intelligent web-phishing detection and protection scheme using integrated features of Images, frames and text. Expert Systems with Applications 115, 300–313 (2019). https://doi.org/10.1016/J.ESWA.2018.07.067, https://www.sciencedirect.com/science/article/pii/S0957417418304925?via%3Dihub

  2. Anti-Phishing Working Group, I.: Phishing Activity Trends Reports (2019). https://apwg.org/resources/apwg-reports/

  3. Bahnsen, A.C., Bohorquez, E.C., Villegas, S., Vargas, J., Gonzalez, F.A.: Classifying phishing URLs using recurrent neural networks. In: 2017 APWG Symposium on Electronic Crime Research (eCrime), pp. 1–8 (2017). https://doi.org/10.1109/ECRIME.2017.7945048, http://ieeexplore.ieee.org/document/7945048/

  4. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994). https://doi.org/10.1109/72.279181

    Article  Google Scholar 

  5. Chiew, K.L., Tan, C.L., Wong, K., Yong, K.S., Tiong, W.K.: A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Information Sciences 484, 153–166 (2019). https://doi.org/10.1016/j.ins.2019.01.064, https://www.sciencedirect.com/science/article/pii/S0020025519300763?via%3Dihub linkinghub.elsevier.com/retrieve/pii/S0020025519300763

  6. Cho, K., et al.: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078v3 (2014)

  7. Cui, B., He, S., Yao, X., Shi, P., Yao, X., He, S., Cui, B.: Malicious URL detection with feature extraction based on machine learning. Int. J. High Performance Comput. Network. 12(2), 166 (2018). https://doi.org/10.1504/ijhpcn.2018.10015545, http://www.inderscience.com/link.php?id=94367

  8. Gers, F.A., Urgen Schmidhuber, J.J., Cummins, F.: Learning to forget: continual prediction with LSTM. In: Proceedings ICANN 1999 International Conference on Artificial Neural Network, vol. 2, pp. 850–855. IDSIA (1999). http://www.idsia.ch/www.idsia.ch/

  9. Han, J., Moraga, C.: The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 930, pp. 195–201. Springer, Cham (1995). https://doi.org/10.1007/3-540-59497-3_175

  10. Hochreiter, S., Urgen Schmidhuber, J.J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997), http://www7.informatik.tu-muenchen.de/~hochreitwww.idsia.ch/~juergen

  11. Internet Crime Complaint Center: Internet Crime Report 2019. Tech. rep., Internet Crime Complaint Center at the Federal Bureau of Investigation of United States of America (2020). https://www.ic3.gov/media/annualreport/2019_IC3Report.pdf

  12. Kleinbaum, D.G., Klein, M.: Introduction to logistic regression. In: Logistic Regression, pp. 1–39. Springer, New York, NY (2010). https://doi.org/10.1007/978-1-4419-1742-3_1, http://link.springer.com/10.1007/978-1-4419-1742-3_1

  13. Lewis, D.D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: ECML 1998: Machine Learning: ECML-1998, pp. 4–15. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0026666

  14. Lin Tan, C., et al.: PhishWHO: Phishing webpage detection via identity keywords extraction and target domain name finder. Decision Support Systems 88, 18–27 (2016). https://doi.org/10.1016/j.dss.2016.05.005

  15. Marchal, S., Armano, G., Grondahl, T., Saari, K., Singh, N., Asokan, N.: Off-the-hook: an efficient and usable client-side phishing prevention application. IEEE Trans. Comput. 66(10), 1717–1733 (2017). https://doi.org/10.1109/TC.2017.2703808

    Article  MathSciNet  Google Scholar 

  16. Opara, C., Wei, B., Chen, Y.: HTMLPhish: Enabling Accurate Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis. http://arxiv.org/abs/1909.01135arXiv:1909.01135 (2019), http://www.phishtank.com

  17. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011), http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html, https://scikit-learn.org/stable/

  18. Saxe, J., Berlin, K.: eXpose: A character-level convolutional neural network with embeddings for detecting malicious URLs, file paths and registry keys. arXiv preprint arXiv:1702.08568, February 2017, http://arxiv.org/abs/1702.08568

  19. Seifert, C., Welch, I., Komisarczuk, P.: Identification of malicious web pages with static heuristics. In: 2008 Australasian Telecommunication Networks and Applications Conference, pp. 91–96. IEEE, December 2008. https://doi.org/10.1109/ATNAC.2008.4783302, http://ieeexplore.ieee.org/document/4783302/

  20. Shapiro, S.S., Wilk, M.B.: An analysis of variance test for normality (complete samples). Biometrika 52(3/4), 591–611 (1965)

    Article  MathSciNet  Google Scholar 

  21. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  22. Vaitkevicius, P., Marcinkevicius, V.: Comparison of classification algorithms for detection of phishing websites. Informatica 31(1), 143–160 (2020). https://doi.org/10.15388/20-infor404

    Article  MathSciNet  Google Scholar 

  23. Vazhayil, A., Vinayakumar, R., Soman, K.: Comparative study of the detection of malicious URLs using shallow and deep networks. In: 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT). pp. 1–6. IEEE, July 2018. https://doi.org/10.1109/ICCCNT.2018.8494159, https://ieeexplore.ieee.org/document/8494159/

  24. Verma, R., Das, A.: What’s in a URL. In: Proceedings of the 3rd ACM on International Workshop on Security And PrivacyAnalytics - IWSPA 2017, pp. 55–63. ACM Press, New York, New York (2017). https://doi.org/10.1145/3041008.3041016, http://dl.acm.org/citation.cfm?doid=3041008.3041016

  25. Wei, B., Hamad, R.A., Yang, L., He, X., Wang, H., Gao, B., Woo, W.L.: A deep-learning-driven light-weight phishing detection sensor. Sensors 19(19), 4258 (2019). https://doi.org/10.3390/s19194258

    Article  Google Scholar 

  26. Whittaker, C., Ryner, B., Nazif, M.: Large-scale automatic classification of phishing pages. The 17th Annual Network and Distributed System Security Symposium (NDSS 2010) (2010). https://doi.org/10.1109/TDSC.2013.3, http://www.isoc.org/isoc/conferences/ndss/10/pdf/08.pdf%5Cnresearch.google.com/pubs/pub35580.html

  27. Xiang, G., Hong, J., Rose, C.P., Cranor, L.: CANTINA+: A feature-rich machine learning framework for detecting phishing web sites. ACM Trans. Inf. Syst. Secur. 14(2), 1–28 (2011). https://doi.org/10.1145/2019599.2019606, https://www.ml.cmu.edu/research/dap-papers/dap-guang-xiang.pdf

  28. Yang, P., Zhao, G., Zeng, P.: Phishing website detection based on multidimensional features driven by deep learning. IEEE Access 7, 15196–15209 (2019). https://doi.org/10.1109/ACCESS.2019.2892066

    Article  Google Scholar 

  29. Zhao, J., Wang, N., Ma, Q., Cheng, Z.: Classifying malicious URLs using gated recurrent neural networks. In: International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing, pp. 385–394. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-319-93554-6_36

  30. Zhao, P., Hoi, S.C.: Cost-sensitive online active learning with application to malicious URL detection. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2013, p. 919. ACM Press, New York (2013). https://doi.org/10.1145/2487575.2487647, http://dl.acm.org/citation.cfm?doid=2487575.2487647

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Paulius Vaitkevicius .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vaitkevicius, P., Marcinkevicius, V. (2020). Composition of Ensembles of Recurrent Neural Networks for Phishing Websites Detection. In: Robal, T., Haav, HM., Penjam, J., Matulevičius, R. (eds) Databases and Information Systems. DB&IS 2020. Communications in Computer and Information Science, vol 1243. Springer, Cham. https://doi.org/10.1007/978-3-030-57672-1_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-57672-1_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-57671-4

  • Online ISBN: 978-3-030-57672-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics