Skip to main content
Log in

Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Countless cyber threat intelligence (CTI) reports are used by companies around the world on a daily basis for security reasons. To secure critical cybersecurity information, analysts and individuals should accordingly analyze information on threats and vulnerabilities. However, analyzing such overwhelming volumes of reports requires considerable time and effort. In this study, we propose a novel approach that automatically extracts core information from CTI reports using a named entity recognition (NER) system. During the process of constructing our proposed NER system, we defined meaningful keywords in the security domain as entities, including malware, domain/URL, IP address, Hash, and Common Vulnerabilities and Exposures. Furthermore, we linked these keywords with the words extracted from the text data of the report. To achieve a higher performance, we utilized the character-level feature vector as an input to bidirectional long-short-term memory using a conditional random field network. We finally achieved an average F1-score of 75.05%. We release 498,000 tag datasets created during our research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Details of these hyper-parameters are provided in "Appendix 4".

References

  1. Alzaidy R, Caragea C, Giles CL (2019) Bi-LSTM-CRF sequence labeling for keyphrase extraction from scholarly documents. In: The World Wide Web conference. ACM, pp 2551–2557

  2. Bengio Y, Simard P, Frasconi P et al (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166

    Article  Google Scholar 

  3. Bridges RA, Jones CL, Iannacone MD, Testa KM, Goodall JR (2013) Automatic labeling for entity extraction in cyber security. arXiv preprint arXiv:1308.4941

  4. Character-Level V (2018) End-to-end recurrent neural network models for Vietnamese named entity recognition: word-level. In: Computational linguistics: 15th international conference of the Pacific Association for computational linguistics, PACLING 2017, Yangon, Myanmar, 16–18 Aug 2017, Revised Selected Papers, vol 781. Springer, p 219

  5. Chismon D, Ruks M (2015) Threat intelligence: collecting, analysing, evaluating. MWR InfoSecurity Ltd, London

    Google Scholar 

  6. Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNS. Trans Assoc Comput Linguist 4:357–370

    Article  Google Scholar 

  7. Conti M, Dargahi T, Dehghantanha A (2018) Cyber threat intelligence: challenges and opportunities. Springer, Berlin

    Google Scholar 

  8. Corbett P, Boyle J (2018) Chemlistem: chemical named entity recognition using recurrent neural networks. J Cheminform 10(1):59

    Article  Google Scholar 

  9. Elman JL (1990) Finding structure in time. Cogn Sci 14(2):179–211

    Article  Google Scholar 

  10. Gasmi H, Bouras A, Laval J (2018) Lstm recurrent neural networks for cybersecurity named entity recognition. ICSEA 2018:11

    Google Scholar 

  11. Gasmi H, Laval J, Bouras A (2019) Information extraction of cybersecurity concepts: an LSTM approach. Appl Sci 9(19):3945

    Article  Google Scholar 

  12. Goldberg Y (2016) A primer on neural network models for natural language processing. J Artif Intell Res 57:345–420

    Article  MathSciNet  Google Scholar 

  13. Gordon MS (2018) Economic and national security effects of cyber attacks against small business communities. PhD thesis, Utica College

  14. Graves A (2012) Supervised sequence labelling. In: Supervised sequence labelling with recurrent neural networks. Springer, pp 5–13

  15. Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 6645–6649

  16. Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw 18(5–6):602–610

    Article  Google Scholar 

  17. Habibi M, Weber L, Neves M, Wiegandt DL, Leser U (2017) Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14):i37–i48

    Article  Google Scholar 

  18. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  19. Hu Z, Ma X, Liu Z, Hovy E, Xing E (2016) Harnessing deep neural networks with logic rules. arXiv preprint arXiv:1603.06318

  20. Huang Z, Xu W, Yu K (2015) Bidirectional lSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991

  21. Yang JL, Jayakumar A (2014) Target says up to 70 million more customers were hit by December data breach. Washington Post, 10 Jan 2014

  22. Joshi A, Lal R, Finin T, Joshi A (2013) Extracting cybersecurity related linked data from text. In: 2013 IEEE seventh international conference on semantic computing. IEEE, pp 252–259

  23. Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525

    Article  Google Scholar 

  24. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436

    Google Scholar 

  25. Lee C, Kim YB, Lee D, Lim H (2018) Character-level feature extraction with densely connected networks. arXiv preprint arXiv:1806.09089

  26. McCallum A, Freitag D, Pereira FC (2000) Maximum entropy Markov models for information extraction and segmentation. In: ICML, vol 17, pp 591–598

  27. Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Eleventh annual conference of the international speech communication association

  28. More S, Matthews M, Joshi A, Finin T (2012) A knowledge-based approach to intrusion detection modeling. In: 2012 IEEE symposium on security and privacy workshops. IEEE, pp 75–81

  29. Mulwad V, Li W, Joshi A, Finin T, Viswanathan K (2011) Extracting information about security vulnerabilities from web text. In: Proceedings of the 2011 IEEE/WIC/ACM international conferences on web intelligence and intelligent agent technology, vol 03. IEEE Computer Society, pp 257–260

  30. Nunes E, Diab A, Gunn A, Marin E, Mishra V, Paliath V, Robertson J, Shakarian J, Thart A, Shakarian P (2016) Darknet and deepnet mining for proactive cybersecurity threat intelligence. In: 2016 IEEE conference on intelligence and security informatics (ISI). IEEE, pp 7–12

  31. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  32. Ratnaparkhi A (1996) A maximum entropy model for part-of-speech tagging. In: Conference on empirical methods in natural language processing

  33. Reimers N, Gurevych I (2017) Reporting score distributions makes a difference: performance study of lSTM-networks for sequence tagging. arXiv preprint arXiv:1707.09861

  34. Robertson J, Diab A, Marin E, Nunes E, Paliath V, Shakarian J, Shakarian P (2016) Darknet mining and game theory for enhanced cyber threat intelligence. Cyber Def Rev 1(2):95–122

    Google Scholar 

  35. Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117

    Article  Google Scholar 

  36. Wu F, Liu J, Wu C, Huang Y, Xie X (2019) Neural Chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation. In: The World Wide Web conference. ACM, pp 3342–3348

  37. Yadav V, Bethard S (2019) A survey on recent advances in named entity recognition from deep learning models. arXiv preprint arXiv:1910.11470

  38. Zhou K, Zhang S, Meng X, Luo Q, Wang Y, Ding K, Feng Y, Chen M, Cohen K, Xia J (2018) CRF-LSTM text mining method unveiling the pharmacological mechanism of off-target side effect of anti-multiple Myeloma drugs. In: Proceedings of the BioNLP 2018 workshop, pp 166–171

Download references

Funding

Funding was provide by Korea Creative Content Agency (Grant No. R2017030045).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heuiseok Lim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Text extraction

1.1 Unnecessary contents

PDF documents contain unnecessary information such as page numbers, annotations, and watermarks, making it difficult for researchers to identify text, as well as incorrect sentence endings that are regarded as noise during text extraction. Thus, such information should be eliminated by converting an unstructured format into an HTML document providing useful information such as the font, font size, and line break(<br> tag), among other factors (Table 6). During this process, the optimized ratio for the best noise elimination occurs when we eliminate less than 5% of the strings of the entire document. Also, we delete repetitive strings exceeding the threshold (0.5) of the normalized value.

Table 6 Initial text from extracting PDF document as an HTML file

Discrepancies of the character sets are another issue. For example, Unicode has 20 different characteristics for quotation marks(“ ”). This causes errors in applying a rule-based deep learning approach. Thus, we apply the same Unicode (Appendix 2) for words having the same meaning. This problem is addressed by implementing post-processing that replaces semantically identical or seemingly similar characters with keyboard characters. Although noise elimination using HTML offers useful separators, representing a complete sentence is difficult. Therefore, finding a method for determining whether two lines are consecutive, which is difficult to detect through HTML code, is crucial.

1.2 Text line conjunction

For example, the position where a large number of newline tags are present, <br> may indicate the beginning of a new paragraph or the same sentence followed by the previous line.

Hyphens (-) also cause ambiguity. In English, some words originally include hyphens (-) in them; however, a long word at the end of the line should also be divided by two words with a hyphen (-) if there is no space to place it as a single word. In addition, a period (.) cannot be represented as an indicator of a sentence ending because some various acronyms and abbreviations use marks, e.g., a postscript (P.S.) In particular, the dataset used herein includes numerous instances of periods (.) in URLs, IP addresses, and subtitles with no period (.), which increases the ambiguity while analyzing HTML tags. These problems can be solved by predicting the original word, which is based on the frequency of the word occurrences of the preprocessed corpus, and we use Wikipedia (the English version) as a reference corpus. We take several types of words into consideration as follows:

  1. 1.

    Word positioning between the previous and following line, hyphenated or not.

  2. 2.

    Words including a spatial gap or not when calculating the word frequency.

  3. 3.

    Words applying the most frequently occurring word combination.

In the case of zero frequency out of all word combinations, two continuous lines are joined. However, if the character size does not match the next line in the HTML tags, no treatment is executed to determine if they are discontinuous.

Owing to the nature of an intelligence report, there is an excessive number of periods in the text that do not work with an existing sentence boundary detection model. These limits led us to develop a customized sentence boundary recognizer for intelligence reports combined with both an unsupervised learning model and regular expressions. Specifically, we use a rule-based process that replaces the irrelevant periods in the URLs and IP addresses with a special token, and then applies a period-based sentence boundary detection model that enables us to distinguish the correct sentence. However, there are still some irrelevant periods that cannot be removed from this process, and an insufficient amount of data are available for analysis; thus, we apply [23] to enable the learning of a period-based sentence boundary detection model, which is impossible to achieve through this method when using Wikipedia (English version) as a reference corpus.

To minimize the risk of misreading the first word as a part of the URL, we collect the first word and its frequency in each sentence from Wikipedia (English version) and separate the latter word if it frequently appears (more than 5000 times) at the beginning of the sentence. In addition, we can improve upon our performance through tokenization by not only applying word-spacing but also by separating several characters such as parentheses, periods, and quotation marks. The word tokenizer of Python’s Natural Language Toolkit is properly modified for the target corpus in our study.

After the tokenization is completed, we put the original strings back to the replaced special tokens designed for noise elimination and re-united inappropriately separated special characters, e.g., URLs. Other treatments, including spacing words, were also executed. Table 7 represents the optimized text from the PDF documents as follows:

Table 7 Data representation of extracted information

Appendix 2: Unicode corresponding characters

See Table 8.

Table 8 The Unicode of probable characters collected and analyzed in this study is as follows

Appendix 3: Key 20 entities in cybersecurity

See Table 9.

Table 9 Class definition of data categories

Appendix 4: Hyper-parameter Settings

See Table 10.

Table 10 Hyper-parameters in BOC and Bi-LSTM-CRF model

The hyper-parameters used in our study are shown above. We set the default values used by Lee et al. [25], and specified 200 additional hidden dimensions and a dropout of 0.9 because BOC are deployed as additional features for the pre- and post-Bi-LSTM-CRF models. In the forward and backward directions of the LSTM model, we set the hidden feature-dims to 300.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, G., Lee, C., Jo, J. et al. Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network. Int. J. Mach. Learn. & Cyber. 11, 2341–2355 (2020). https://doi.org/10.1007/s13042-020-01122-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-020-01122-6

Keywords

Navigation