Skip to main content

A Fast Algorithm for Constructing Phylogenetic Trees with Application to IoT Malware Clustering

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11953))

Included in the following conference series:

Abstract

For efficiently handling thousands of malware specimens, we aim to quickly and automatically categorize those into malware families. A solution for this could be the neighbor-joining method using NCD (Normalized Compression Distance) as similarity of malware. It creates a phylogenetic tree of malware based on the NCDs between malware binaries for clustering. However, it is frustratingly slow because it requires \((N^2+N)/2\) compression attempts for the NCDs, where N is the number of given specimens. For fast clustering, this paper presents an algorithm for efficiently constructing a phylogenetic tree by greatly reducing compression attempts. The key idea to do so is not to construct a tree of N specimens all at once. Instead, it divides N specimens into temporal clusters in advance, constructs a small tree for each temporal cluster, and joins the trees as a united tree. Intuitively, separately constructing small trees requires a much smaller number of compression attempts than \((N^2+N)/2\). With experiments using 4,109 in-the-wild malware specimens, we confirm that our algorithm achieved clustering 22 times faster than the neighbor-joining method with a good accuracy of 97%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    A preliminary version of this algorithm was proposed by Takumi Yone in the master’s thesis [19] of Kyushu University, who was supervised by some authors of this paper.

References

  1. Malwr. https://malwr.com/

  2. Virustotal. https://www.virustotal.com/

  3. Antonakakis, M., et al.: Understanding the Mirai botnet. In: Proceedings of the 26th USENIX Conference on Security Symposium, SEC 2017, pp. 1093–1110. USENIX Association, Berkeley (2017)

    Google Scholar 

  4. Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Kruegel, C., Lippmann, R., Clark, A. (eds.) RAID 2007. LNCS, vol. 4637, pp. 178–197. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74320-0_10

    Chapter  Google Scholar 

  5. Bayer, U., Comparetti, P.M., Hlauschek, C., Krügel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: Proceedings of the Network and Distributed System Security Symposium, NDSS 2009, San Diego, pp. 8–11 (2009)

    Google Scholar 

  6. Black Lotus Labs: Attack of things! https://www.netformation.com/our-pov/attack-of-things-2/

  7. Cebrian, M., Alfonseca, M., Ortega, A.: The normalized compression distance is resistant to noise. IEEE Trans. Inf. Theory 53(5), 1895–1900 (2007)

    Article  MathSciNet  Google Scholar 

  8. Cilibrasi, R., Vitanyi, P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005)

    Article  MathSciNet  Google Scholar 

  9. Doctor Web: Dr.Web. https://www.drweb.com

  10. Elias, I., Lagergren, J.: Fast neighbor joining. In: Caires, L., Italiano, G.F., Monteiro, L., Palamidessi, C., Yung, M. (eds.) ICALP 2005. LNCS, vol. 3580, pp. 1263–1274. Springer, Heidelberg (2005). https://doi.org/10.1007/11523468_102

    Chapter  Google Scholar 

  11. Karim, M.E., Walenstein, A., Lakhotia, A., Parida, L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1(1–2), 13–23 (2005)

    Article  Google Scholar 

  12. Langfelder, P., Zhang, B., Horvath, S.: Defining clusters from a hierarchical cluster tree: the dynamic tree cut package for R. Bioinformatics 24(5), 719–720 (2007)

    Article  Google Scholar 

  13. Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)

    Article  MathSciNet  Google Scholar 

  14. Pa, Y.M.P., Suzuki, S., Yoshioka, K., Matsumoto, T., Kasama, T., Rossow, C.: IoTPOT: analysing the rise of IoT compromises. In: 9th USENIX Workshop on Offensive Technologies, WOOT 2015. USENIX Association, Washington, D.C. (2015)

    Google Scholar 

  15. Price, M.N., Dehal, P.S., Arkin, A.P.: Fasttree 2 - approximately maximum-likelihood trees for large alignments. PLOS ONE 5(3), 1–10 (2010)

    Article  Google Scholar 

  16. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4(4), 406–425 (1987)

    Google Scholar 

  17. Salomon, D.: Data Compression - The Complete Reference, 4th edn. Springer, London (2007). https://doi.org/10.1007/978-1-84628-603-2

    Book  MATH  Google Scholar 

  18. Simonsen, M., Mailund, T., Pedersen, C.N.S.: Rapid neighbour-joining. In: Crandall, K.A., Lagergren, J. (eds.) WABI 2008. LNCS, vol. 5251, pp. 113–122. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87361-7_10

    Chapter  Google Scholar 

  19. Yone, T.: Phylogenetic tree estimation for large-scale malware datasets. Master’s thesis. Kyushu University, Japan (2016). (in Japanese)

    Google Scholar 

Download references

Acknowledgment

The authors wish to thank the IoTPOT team from Yokohama National University for providing the dataset. This research was partially supported by JSPS KAKENHI Grant Number 18H03291.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tianxiang He .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

He, T. et al. (2019). A Fast Algorithm for Constructing Phylogenetic Trees with Application to IoT Malware Clustering. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Lecture Notes in Computer Science(), vol 11953. Springer, Cham. https://doi.org/10.1007/978-3-030-36708-4_63

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-36708-4_63

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-36707-7

  • Online ISBN: 978-3-030-36708-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics