Feature Importance for Biomedical Named Entity Recognition

Huggard, Hamish; Zhang, Aaron; Zhang, Edmond; Koh, Yun Sing

doi:10.1007/978-3-030-35288-2_33

Hamish Huggard¹⁰,
Aaron Zhang¹⁰,
Edmond Zhang¹¹ &
…
Yun Sing Koh¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11919))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

2178 Accesses
2 Citations

Abstract

Within the domain of biomedical natural language processing (bioNLP), researchers have used many token features for machine learning models. With recent progress in word embeddings algorithms, it is no longer clear if most of these features are still useful. In this paper we survey the features which have been used in bioNLP, and evaluate each feature’s utility in a sample bioNLP task: the N2C2 2018 named entity recognition challenge. The features we test include two types of word embeddings, syntactic, lexical, and orthographic features, character-embeddings, and clustering and distributional word representations. We find that using fastText word embeddings results in a significantly higher F\(_1\) score than using any other individual feature (0.9142 compared to 0.8750 for the next-best feature). Furthermore, we conducted several experiments using combinations of features, and found that all tested combinations attained a lower F\(_1\) score than using word embeddings only. This indicates that supplementing word embeddings with additional features is not beneficial, and may even be detrimental.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
Google Scholar
Campos, D., Matos, S., Oliveira, J.L.: Biomedical named entity recognition: a survey of machine-learning tools. In: Theory and Applications for Advanced Text Mining. IntechOpen (2012)
Google Scholar
Chen, Y., et al.: Named entity recognition from Chinese adverse drug event reports with lexical feature based BiLSTM-CRF and tri-training. J. Biomed. Inform. 96, 103252 (2019)
Article Google Scholar
Edwards, I.R., Aronson, J.K.: Adverse drug reactions: definitions, diagnosis, and management. Lancet 356(9237), 1255–1259 (2000)
Article Google Scholar
Hettne, K.M., et al.: A dictionary to identify small molecules and drugs in free text. Bioinformatics 25(22), 2983–2991 (2009)
Article Google Scholar
Johnson, A.E., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3 (2016). Article number: 160035
Google Scholar
Kim, J.D., Ohta, T., Teteisi, Y., Tsujii, J.: Genia corpus manual. Technical report, Citeseer (2006)
Google Scholar
Liang, P.: Semi-supervised learning for natural language. Ph.D. thesis, Massachusetts Institute of Technology (2005)
Google Scholar
Liu, F., Chen, J., Jagannatha, A., Yu, H.: Learning for biomedical information extraction: methodological review of recent advances. arXiv preprint arXiv:1606.07993 (2016)
Liu, S., Tang, B., Chen, Q., Wang, X.: Effects of semantic features on machine learning-based drug name recognition systems: word embeddings vs. manually constructed dictionaries. Information 6(4), 848–865 (2015)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Muneeb, T., Sahu, S., Anand, A.: Evaluating distributed word representations for capturing semantics of biomedical concepts. In: Proceedings of BioNLP 2015, pp. 158–163 (2015)
Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investig. 30(1), 3–26 (2007)
Article Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)
QasemiZadeh, B., Handschuh, S.: Random indexing explained with high probability. In: Král, P., Matoušek, V. (eds.) TSD 2015. LNCS (LNAI), vol. 9302, pp. 414–423. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24033-6_47
Chapter Google Scholar
dos Santos, C.N., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1818–1826 (2014)
Google Scholar
Tang, B., Cao, H., Wang, X., Chen, Q., Xu, H.: Evaluating word representation features in biomedical named entity recognition tasks. BioMed Res. Int. 2014 (2014)
Google Scholar
Tsuruoka, Y., et al.: Developing a robust part-of-speech tagger for biomedical text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005). https://doi.org/10.1007/11573036_36
Chapter Google Scholar
Wang, Y., et al.: A comparison of word embeddings for the biomedical natural language processing. J. Biomed. Inform. 87, 12–20 (2018)
Article Google Scholar
Wang, Y., et al.: Clinical information extraction applications: a literature review. J. Biomed. Inform. 77, 34–49 (2018)
Article Google Scholar
Wishart, D.S., et al.: DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46(D1), D1074–D1082 (2017)
Article Google Scholar
Wu, Y., Xu, J., Jiang, M., Zhang, Y., Xu, H.: A study of neural word embeddings for named entity recognition in clinical text. In: AMIA Annual Symposium Proceedings, vol. 2015, p. 1326. American Medical Informatics Association (2015)
Google Scholar

Download references

Acknowledgements

This research was supported by the Precision Driven Health Partnership (www.precisiondrivenhealth.com).

Author information

Authors and Affiliations

University of Auckland, Auckland, 1010, New Zealand
Hamish Huggard, Aaron Zhang & Yun Sing Koh
Orion Health, 181 Grafton Road, Grafton, Auckland, 1010, New Zealand
Edmond Zhang

Authors

Hamish Huggard
View author publications
You can also search for this author in PubMed Google Scholar
Aaron Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Edmond Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yun Sing Koh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hamish Huggard .

Editor information

Editors and Affiliations

University of South Australia, Adelaide, SA, Australia
Jixue Liu
The University of Melbourne, Melbourne, VIC, Australia
James Bailey

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huggard, H., Zhang, A., Zhang, E., Koh, Y.S. (2019). Feature Importance for Biomedical Named Entity Recognition. In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science(), vol 11919. Springer, Cham. https://doi.org/10.1007/978-3-030-35288-2_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-35288-2_33
Published: 25 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35287-5
Online ISBN: 978-3-030-35288-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics