Abstract
Countless cyber threat intelligence (CTI) reports are used by companies around the world on a daily basis for security reasons. To secure critical cybersecurity information, analysts and individuals should accordingly analyze information on threats and vulnerabilities. However, analyzing such overwhelming volumes of reports requires considerable time and effort. In this study, we propose a novel approach that automatically extracts core information from CTI reports using a named entity recognition (NER) system. During the process of constructing our proposed NER system, we defined meaningful keywords in the security domain as entities, including malware, domain/URL, IP address, Hash, and Common Vulnerabilities and Exposures. Furthermore, we linked these keywords with the words extracted from the text data of the report. To achieve a higher performance, we utilized the character-level feature vector as an input to bidirectional long-short-term memory using a conditional random field network. We finally achieved an average F1-score of 75.05%. We release 498,000 tag datasets created during our research.
Similar content being viewed by others
Notes
Details of these hyper-parameters are provided in "Appendix 4".
References
Alzaidy R, Caragea C, Giles CL (2019) Bi-LSTM-CRF sequence labeling for keyphrase extraction from scholarly documents. In: The World Wide Web conference. ACM, pp 2551–2557
Bengio Y, Simard P, Frasconi P et al (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Bridges RA, Jones CL, Iannacone MD, Testa KM, Goodall JR (2013) Automatic labeling for entity extraction in cyber security. arXiv preprint arXiv:1308.4941
Character-Level V (2018) End-to-end recurrent neural network models for Vietnamese named entity recognition: word-level. In: Computational linguistics: 15th international conference of the Pacific Association for computational linguistics, PACLING 2017, Yangon, Myanmar, 16–18 Aug 2017, Revised Selected Papers, vol 781. Springer, p 219
Chismon D, Ruks M (2015) Threat intelligence: collecting, analysing, evaluating. MWR InfoSecurity Ltd, London
Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNS. Trans Assoc Comput Linguist 4:357–370
Conti M, Dargahi T, Dehghantanha A (2018) Cyber threat intelligence: challenges and opportunities. Springer, Berlin
Corbett P, Boyle J (2018) Chemlistem: chemical named entity recognition using recurrent neural networks. J Cheminform 10(1):59
Elman JL (1990) Finding structure in time. Cogn Sci 14(2):179–211
Gasmi H, Bouras A, Laval J (2018) Lstm recurrent neural networks for cybersecurity named entity recognition. ICSEA 2018:11
Gasmi H, Laval J, Bouras A (2019) Information extraction of cybersecurity concepts: an LSTM approach. Appl Sci 9(19):3945
Goldberg Y (2016) A primer on neural network models for natural language processing. J Artif Intell Res 57:345–420
Gordon MS (2018) Economic and national security effects of cyber attacks against small business communities. PhD thesis, Utica College
Graves A (2012) Supervised sequence labelling. In: Supervised sequence labelling with recurrent neural networks. Springer, pp 5–13
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 6645–6649
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw 18(5–6):602–610
Habibi M, Weber L, Neves M, Wiegandt DL, Leser U (2017) Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14):i37–i48
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hu Z, Ma X, Liu Z, Hovy E, Xing E (2016) Harnessing deep neural networks with logic rules. arXiv preprint arXiv:1603.06318
Huang Z, Xu W, Yu K (2015) Bidirectional lSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991
Yang JL, Jayakumar A (2014) Target says up to 70 million more customers were hit by December data breach. Washington Post, 10 Jan 2014
Joshi A, Lal R, Finin T, Joshi A (2013) Extracting cybersecurity related linked data from text. In: 2013 IEEE seventh international conference on semantic computing. IEEE, pp 252–259
Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436
Lee C, Kim YB, Lee D, Lim H (2018) Character-level feature extraction with densely connected networks. arXiv preprint arXiv:1806.09089
McCallum A, Freitag D, Pereira FC (2000) Maximum entropy Markov models for information extraction and segmentation. In: ICML, vol 17, pp 591–598
Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Eleventh annual conference of the international speech communication association
More S, Matthews M, Joshi A, Finin T (2012) A knowledge-based approach to intrusion detection modeling. In: 2012 IEEE symposium on security and privacy workshops. IEEE, pp 75–81
Mulwad V, Li W, Joshi A, Finin T, Viswanathan K (2011) Extracting information about security vulnerabilities from web text. In: Proceedings of the 2011 IEEE/WIC/ACM international conferences on web intelligence and intelligent agent technology, vol 03. IEEE Computer Society, pp 257–260
Nunes E, Diab A, Gunn A, Marin E, Mishra V, Paliath V, Robertson J, Shakarian J, Thart A, Shakarian P (2016) Darknet and deepnet mining for proactive cybersecurity threat intelligence. In: 2016 IEEE conference on intelligence and security informatics (ISI). IEEE, pp 7–12
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Ratnaparkhi A (1996) A maximum entropy model for part-of-speech tagging. In: Conference on empirical methods in natural language processing
Reimers N, Gurevych I (2017) Reporting score distributions makes a difference: performance study of lSTM-networks for sequence tagging. arXiv preprint arXiv:1707.09861
Robertson J, Diab A, Marin E, Nunes E, Paliath V, Shakarian J, Shakarian P (2016) Darknet mining and game theory for enhanced cyber threat intelligence. Cyber Def Rev 1(2):95–122
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Wu F, Liu J, Wu C, Huang Y, Xie X (2019) Neural Chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation. In: The World Wide Web conference. ACM, pp 3342–3348
Yadav V, Bethard S (2019) A survey on recent advances in named entity recognition from deep learning models. arXiv preprint arXiv:1910.11470
Zhou K, Zhang S, Meng X, Luo Q, Wang Y, Ding K, Feng Y, Chen M, Cohen K, Xia J (2018) CRF-LSTM text mining method unveiling the pharmacological mechanism of off-target side effect of anti-multiple Myeloma drugs. In: Proceedings of the BioNLP 2018 workshop, pp 166–171
Funding
Funding was provide by Korea Creative Content Agency (Grant No. R2017030045).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1: Text extraction
1.1 Unnecessary contents
PDF documents contain unnecessary information such as page numbers, annotations, and watermarks, making it difficult for researchers to identify text, as well as incorrect sentence endings that are regarded as noise during text extraction. Thus, such information should be eliminated by converting an unstructured format into an HTML document providing useful information such as the font, font size, and line break(<br> tag), among other factors (Table 6). During this process, the optimized ratio for the best noise elimination occurs when we eliminate less than 5% of the strings of the entire document. Also, we delete repetitive strings exceeding the threshold (0.5) of the normalized value.
Discrepancies of the character sets are another issue. For example, Unicode has 20 different characteristics for quotation marks(“ ”). This causes errors in applying a rule-based deep learning approach. Thus, we apply the same Unicode (Appendix 2) for words having the same meaning. This problem is addressed by implementing post-processing that replaces semantically identical or seemingly similar characters with keyboard characters. Although noise elimination using HTML offers useful separators, representing a complete sentence is difficult. Therefore, finding a method for determining whether two lines are consecutive, which is difficult to detect through HTML code, is crucial.
1.2 Text line conjunction
For example, the position where a large number of newline tags are present, <br> may indicate the beginning of a new paragraph or the same sentence followed by the previous line.
Hyphens (-) also cause ambiguity. In English, some words originally include hyphens (-) in them; however, a long word at the end of the line should also be divided by two words with a hyphen (-) if there is no space to place it as a single word. In addition, a period (.) cannot be represented as an indicator of a sentence ending because some various acronyms and abbreviations use marks, e.g., a postscript (P.S.) In particular, the dataset used herein includes numerous instances of periods (.) in URLs, IP addresses, and subtitles with no period (.), which increases the ambiguity while analyzing HTML tags. These problems can be solved by predicting the original word, which is based on the frequency of the word occurrences of the preprocessed corpus, and we use Wikipedia (the English version) as a reference corpus. We take several types of words into consideration as follows:
-
1.
Word positioning between the previous and following line, hyphenated or not.
-
2.
Words including a spatial gap or not when calculating the word frequency.
-
3.
Words applying the most frequently occurring word combination.
In the case of zero frequency out of all word combinations, two continuous lines are joined. However, if the character size does not match the next line in the HTML tags, no treatment is executed to determine if they are discontinuous.
Owing to the nature of an intelligence report, there is an excessive number of periods in the text that do not work with an existing sentence boundary detection model. These limits led us to develop a customized sentence boundary recognizer for intelligence reports combined with both an unsupervised learning model and regular expressions. Specifically, we use a rule-based process that replaces the irrelevant periods in the URLs and IP addresses with a special token, and then applies a period-based sentence boundary detection model that enables us to distinguish the correct sentence. However, there are still some irrelevant periods that cannot be removed from this process, and an insufficient amount of data are available for analysis; thus, we apply [23] to enable the learning of a period-based sentence boundary detection model, which is impossible to achieve through this method when using Wikipedia (English version) as a reference corpus.
To minimize the risk of misreading the first word as a part of the URL, we collect the first word and its frequency in each sentence from Wikipedia (English version) and separate the latter word if it frequently appears (more than 5000 times) at the beginning of the sentence. In addition, we can improve upon our performance through tokenization by not only applying word-spacing but also by separating several characters such as parentheses, periods, and quotation marks. The word tokenizer of Python’s Natural Language Toolkit is properly modified for the target corpus in our study.
After the tokenization is completed, we put the original strings back to the replaced special tokens designed for noise elimination and re-united inappropriately separated special characters, e.g., URLs. Other treatments, including spacing words, were also executed. Table 7 represents the optimized text from the PDF documents as follows:
Appendix 2: Unicode corresponding characters
See Table 8.
Appendix 3: Key 20 entities in cybersecurity
See Table 9.
Appendix 4: Hyper-parameter Settings
See Table 10.
The hyper-parameters used in our study are shown above. We set the default values used by Lee et al. [25], and specified 200 additional hidden dimensions and a dropout of 0.9 because BOC are deployed as additional features for the pre- and post-Bi-LSTM-CRF models. In the forward and backward directions of the LSTM model, we set the hidden feature-dims to 300.
Rights and permissions
About this article
Cite this article
Kim, G., Lee, C., Jo, J. et al. Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network. Int. J. Mach. Learn. & Cyber. 11, 2341–2355 (2020). https://doi.org/10.1007/s13042-020-01122-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-020-01122-6