Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network

Kim, Gyeongmin; Lee, Chanhee; Jo, Jaechoon; Lim, Heuiseok

doi:10.1007/s13042-020-01122-6

Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network

Original Article
Published: 02 May 2020

Volume 11, pages 2341–2355, (2020)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Gyeongmin Kim¹,
Chanhee Lee¹,
Jaechoon Jo² &
…
Heuiseok Lim ORCID: orcid.org/0000-0002-9269-1157¹

1860 Accesses
33 Citations
Explore all metrics

Abstract

Countless cyber threat intelligence (CTI) reports are used by companies around the world on a daily basis for security reasons. To secure critical cybersecurity information, analysts and individuals should accordingly analyze information on threats and vulnerabilities. However, analyzing such overwhelming volumes of reports requires considerable time and effort. In this study, we propose a novel approach that automatically extracts core information from CTI reports using a named entity recognition (NER) system. During the process of constructing our proposed NER system, we defined meaningful keywords in the security domain as entities, including malware, domain/URL, IP address, Hash, and Common Vulnerabilities and Exposures. Furthermore, we linked these keywords with the words extracted from the text data of the report. To achieve a higher performance, we utilized the character-level feature vector as an input to bidirectional long-short-term memory using a conditional random field network. We finally achieved an average F1-score of 75.05%. We release 498,000 tag datasets created during our research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Cybersecurity data science: an overview from machine learning perspective

Article Open access 01 July 2020

Machine Learning for Intelligent Data Analysis and Automation in Cybersecurity: Current and Future Prospects

Article Open access 19 September 2022

Applying NLP techniques to malware detection in a practical environment

Article Open access 06 June 2021

Notes

Details of these hyper-parameters are provided in "Appendix 4".

References

Alzaidy R, Caragea C, Giles CL (2019) Bi-LSTM-CRF sequence labeling for keyphrase extraction from scholarly documents. In: The World Wide Web conference. ACM, pp 2551–2557
Bengio Y, Simard P, Frasconi P et al (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Article Google Scholar
Bridges RA, Jones CL, Iannacone MD, Testa KM, Goodall JR (2013) Automatic labeling for entity extraction in cyber security. arXiv preprint arXiv:1308.4941
Character-Level V (2018) End-to-end recurrent neural network models for Vietnamese named entity recognition: word-level. In: Computational linguistics: 15th international conference of the Pacific Association for computational linguistics, PACLING 2017, Yangon, Myanmar, 16–18 Aug 2017, Revised Selected Papers, vol 781. Springer, p 219
Chismon D, Ruks M (2015) Threat intelligence: collecting, analysing, evaluating. MWR InfoSecurity Ltd, London
Google Scholar
Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNS. Trans Assoc Comput Linguist 4:357–370
Article Google Scholar
Conti M, Dargahi T, Dehghantanha A (2018) Cyber threat intelligence: challenges and opportunities. Springer, Berlin
Google Scholar
Corbett P, Boyle J (2018) Chemlistem: chemical named entity recognition using recurrent neural networks. J Cheminform 10(1):59
Article Google Scholar
Elman JL (1990) Finding structure in time. Cogn Sci 14(2):179–211
Article Google Scholar
Gasmi H, Bouras A, Laval J (2018) Lstm recurrent neural networks for cybersecurity named entity recognition. ICSEA 2018:11
Google Scholar
Gasmi H, Laval J, Bouras A (2019) Information extraction of cybersecurity concepts: an LSTM approach. Appl Sci 9(19):3945
Article Google Scholar
Goldberg Y (2016) A primer on neural network models for natural language processing. J Artif Intell Res 57:345–420
Article MathSciNet Google Scholar
Gordon MS (2018) Economic and national security effects of cyber attacks against small business communities. PhD thesis, Utica College
Graves A (2012) Supervised sequence labelling. In: Supervised sequence labelling with recurrent neural networks. Springer, pp 5–13
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 6645–6649
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw 18(5–6):602–610
Article Google Scholar
Habibi M, Weber L, Neves M, Wiegandt DL, Leser U (2017) Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14):i37–i48
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hu Z, Ma X, Liu Z, Hovy E, Xing E (2016) Harnessing deep neural networks with logic rules. arXiv preprint arXiv:1603.06318
Huang Z, Xu W, Yu K (2015) Bidirectional lSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991
Yang JL, Jayakumar A (2014) Target says up to 70 million more customers were hit by December data breach. Washington Post, 10 Jan 2014
Joshi A, Lal R, Finin T, Joshi A (2013) Extracting cybersecurity related linked data from text. In: 2013 IEEE seventh international conference on semantic computing. IEEE, pp 252–259
Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525
Article Google Scholar
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436
Google Scholar
Lee C, Kim YB, Lee D, Lim H (2018) Character-level feature extraction with densely connected networks. arXiv preprint arXiv:1806.09089
McCallum A, Freitag D, Pereira FC (2000) Maximum entropy Markov models for information extraction and segmentation. In: ICML, vol 17, pp 591–598
Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Eleventh annual conference of the international speech communication association
More S, Matthews M, Joshi A, Finin T (2012) A knowledge-based approach to intrusion detection modeling. In: 2012 IEEE symposium on security and privacy workshops. IEEE, pp 75–81
Mulwad V, Li W, Joshi A, Finin T, Viswanathan K (2011) Extracting information about security vulnerabilities from web text. In: Proceedings of the 2011 IEEE/WIC/ACM international conferences on web intelligence and intelligent agent technology, vol 03. IEEE Computer Society, pp 257–260
Nunes E, Diab A, Gunn A, Marin E, Mishra V, Paliath V, Robertson J, Shakarian J, Thart A, Shakarian P (2016) Darknet and deepnet mining for proactive cybersecurity threat intelligence. In: 2016 IEEE conference on intelligence and security informatics (ISI). IEEE, pp 7–12
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Ratnaparkhi A (1996) A maximum entropy model for part-of-speech tagging. In: Conference on empirical methods in natural language processing
Reimers N, Gurevych I (2017) Reporting score distributions makes a difference: performance study of lSTM-networks for sequence tagging. arXiv preprint arXiv:1707.09861
Robertson J, Diab A, Marin E, Nunes E, Paliath V, Shakarian J, Shakarian P (2016) Darknet mining and game theory for enhanced cyber threat intelligence. Cyber Def Rev 1(2):95–122
Google Scholar
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Article Google Scholar
Wu F, Liu J, Wu C, Huang Y, Xie X (2019) Neural Chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation. In: The World Wide Web conference. ACM, pp 3342–3348
Yadav V, Bethard S (2019) A survey on recent advances in named entity recognition from deep learning models. arXiv preprint arXiv:1910.11470
Zhou K, Zhang S, Meng X, Luo Q, Wang Y, Ding K, Feng Y, Chen M, Cohen K, Xia J (2018) CRF-LSTM text mining method unveiling the pharmacological mechanism of off-target side effect of anti-multiple Myeloma drugs. In: Proceedings of the BioNLP 2018 workshop, pp 166–171

Download references

Funding

Funding was provide by Korea Creative Content Agency (Grant No. R2017030045).

Author information

Authors and Affiliations

Korea University, Anam-dong, Seongbuk-gu, Seoul, 02841, Republic of Korea
Gyeongmin Kim, Chanhee Lee & Heuiseok Lim
Hanshin University, 137, Hanshindae-gil, Osan-si, 18101, Republic of Korea
Jaechoon Jo

Authors

Gyeongmin Kim
View author publications
You can also search for this author in PubMed Google Scholar
Chanhee Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jaechoon Jo
View author publications
You can also search for this author in PubMed Google Scholar
Heuiseok Lim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heuiseok Lim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Text extraction

1.1 Unnecessary contents

PDF documents contain unnecessary information such as page numbers, annotations, and watermarks, making it difficult for researchers to identify text, as well as incorrect sentence endings that are regarded as noise during text extraction. Thus, such information should be eliminated by converting an unstructured format into an HTML document providing useful information such as the font, font size, and line break(<br> tag), among other factors (Table 6). During this process, the optimized ratio for the best noise elimination occurs when we eliminate less than 5% of the strings of the entire document. Also, we delete repetitive strings exceeding the threshold (0.5) of the normalized value.

Table 6 Initial text from extracting PDF document as an HTML file

Full size table

Discrepancies of the character sets are another issue. For example, Unicode has 20 different characteristics for quotation marks(“ ”). This causes errors in applying a rule-based deep learning approach. Thus, we apply the same Unicode (Appendix 2) for words having the same meaning. This problem is addressed by implementing post-processing that replaces semantically identical or seemingly similar characters with keyboard characters. Although noise elimination using HTML offers useful separators, representing a complete sentence is difficult. Therefore, finding a method for determining whether two lines are consecutive, which is difficult to detect through HTML code, is crucial.

1.2 Text line conjunction

For example, the position where a large number of newline tags are present, <br> may indicate the beginning of a new paragraph or the same sentence followed by the previous line.

Hyphens (-) also cause ambiguity. In English, some words originally include hyphens (-) in them; however, a long word at the end of the line should also be divided by two words with a hyphen (-) if there is no space to place it as a single word. In addition, a period (.) cannot be represented as an indicator of a sentence ending because some various acronyms and abbreviations use marks, e.g., a postscript (P.S.) In particular, the dataset used herein includes numerous instances of periods (.) in URLs, IP addresses, and subtitles with no period (.), which increases the ambiguity while analyzing HTML tags. These problems can be solved by predicting the original word, which is based on the frequency of the word occurrences of the preprocessed corpus, and we use Wikipedia (the English version) as a reference corpus. We take several types of words into consideration as follows:

1.
Word positioning between the previous and following line, hyphenated or not.
2.
Words including a spatial gap or not when calculating the word frequency.
3.
Words applying the most frequently occurring word combination.

In the case of zero frequency out of all word combinations, two continuous lines are joined. However, if the character size does not match the next line in the HTML tags, no treatment is executed to determine if they are discontinuous.

Owing to the nature of an intelligence report, there is an excessive number of periods in the text that do not work with an existing sentence boundary detection model. These limits led us to develop a customized sentence boundary recognizer for intelligence reports combined with both an unsupervised learning model and regular expressions. Specifically, we use a rule-based process that replaces the irrelevant periods in the URLs and IP addresses with a special token, and then applies a period-based sentence boundary detection model that enables us to distinguish the correct sentence. However, there are still some irrelevant periods that cannot be removed from this process, and an insufficient amount of data are available for analysis; thus, we apply [23] to enable the learning of a period-based sentence boundary detection model, which is impossible to achieve through this method when using Wikipedia (English version) as a reference corpus.

To minimize the risk of misreading the first word as a part of the URL, we collect the first word and its frequency in each sentence from Wikipedia (English version) and separate the latter word if it frequently appears (more than 5000 times) at the beginning of the sentence. In addition, we can improve upon our performance through tokenization by not only applying word-spacing but also by separating several characters such as parentheses, periods, and quotation marks. The word tokenizer of Python’s Natural Language Toolkit is properly modified for the target corpus in our study.

After the tokenization is completed, we put the original strings back to the replaced special tokens designed for noise elimination and re-united inappropriately separated special characters, e.g., URLs. Other treatments, including spacing words, were also executed. Table 7 represents the optimized text from the PDF documents as follows:

Table 7 Data representation of extracted information

Full size table

Appendix 2: Unicode corresponding characters

See Table 8.

Table 8 The Unicode of probable characters collected and analyzed in this study is as follows

Full size table

Appendix 3: Key 20 entities in cybersecurity

See Table 9.

Table 9 Class definition of data categories

Full size table

Appendix 4: Hyper-parameter Settings

See Table 10.

Table 10 Hyper-parameters in BOC and Bi-LSTM-CRF model

Full size table

The hyper-parameters used in our study are shown above. We set the default values used by Lee et al. [25], and specified 200 additional hidden dimensions and a dropout of 0.9 because BOC are deployed as additional features for the pre- and post-Bi-LSTM-CRF models. In the forward and backward directions of the LSTM model, we set the hidden feature-dims to 300.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, G., Lee, C., Jo, J. et al. Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network. Int. J. Mach. Learn. & Cyber. 11, 2341–2355 (2020). https://doi.org/10.1007/s13042-020-01122-6

Download citation

Received: 15 June 2019
Accepted: 06 April 2020
Published: 02 May 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s13042-020-01122-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network

Abstract

Access this article

Similar content being viewed by others

Cybersecurity data science: an overview from machine learning perspective

Machine Learning for Intelligent Data Analysis and Automation in Cybersecurity: Current and Future Prospects

Applying NLP techniques to malware detection in a practical environment

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1: Text extraction

1.1 Unnecessary contents

1.2 Text line conjunction

Appendix 2: Unicode corresponding characters

Appendix 3: Key 20 entities in cybersecurity

Appendix 4: Hyper-parameter Settings

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic extraction of named entities of cyber threats using a deep Bi-LSTM-CRF network

Abstract

Access this article

Similar content being viewed by others

Cybersecurity data science: an overview from machine learning perspective

Machine Learning for Intelligent Data Analysis and Automation in Cybersecurity: Current and Future Prospects

Applying NLP techniques to malware detection in a practical environment

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1: Text extraction

1.1 Unnecessary contents

1.2 Text line conjunction

Appendix 2: Unicode corresponding characters

Appendix 3: Key 20 entities in cybersecurity

Appendix 4: Hyper-parameter Settings

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation