Skip to main content

Feature Importance for Biomedical Named Entity Recognition

  • Conference paper
  • First Online:
AI 2019: Advances in Artificial Intelligence (AI 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11919))

Included in the following conference series:

Abstract

Within the domain of biomedical natural language processing (bioNLP), researchers have used many token features for machine learning models. With recent progress in word embeddings algorithms, it is no longer clear if most of these features are still useful. In this paper we survey the features which have been used in bioNLP, and evaluate each feature’s utility in a sample bioNLP task: the N2C2 2018 named entity recognition challenge. The features we test include two types of word embeddings, syntactic, lexical, and orthographic features, character-embeddings, and clustering and distributional word representations. We find that using fastText word embeddings results in a significantly higher F\(_1\) score than using any other individual feature (0.9142 compared to 0.8750 for the next-best feature). Furthermore, we conducted several experiments using combinations of features, and found that all tested combinations attained a lower F\(_1\) score than using word embeddings only. This indicates that supplementing word embeddings with additional features is not beneficial, and may even be detrimental.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Our code is available at https://github.com/lajesticvantrashell/N2C2_2018_track_2.

  2. 2.

    https://portal.dbmi.hms.harvard.edu/projects/n2c2-t2/.

  3. 3.

    https://fasttext.cc/docs/en/pretrained-vectors.html.

  4. 4.

    https://fasttext.cc/docs/en/english-vectors.html.

  5. 5.

    https://tfhub.dev/google/elmo/1.

  6. 6.

    https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files.

  7. 7.

    http://www.nactem.ac.uk/GENIA/tagger/.

References

  1. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)

  2. Brown, P.F., Desouza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)

    Google Scholar 

  3. Campos, D., Matos, S., Oliveira, J.L.: Biomedical named entity recognition: a survey of machine-learning tools. In: Theory and Applications for Advanced Text Mining. IntechOpen (2012)

    Google Scholar 

  4. Chen, Y., et al.: Named entity recognition from Chinese adverse drug event reports with lexical feature based BiLSTM-CRF and tri-training. J. Biomed. Inform. 96, 103252 (2019)

    Article  Google Scholar 

  5. Edwards, I.R., Aronson, J.K.: Adverse drug reactions: definitions, diagnosis, and management. Lancet 356(9237), 1255–1259 (2000)

    Article  Google Scholar 

  6. Hettne, K.M., et al.: A dictionary to identify small molecules and drugs in free text. Bioinformatics 25(22), 2983–2991 (2009)

    Article  Google Scholar 

  7. Johnson, A.E., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3 (2016). Article number: 160035

    Google Scholar 

  8. Kim, J.D., Ohta, T., Teteisi, Y., Tsujii, J.: Genia corpus manual. Technical report, Citeseer (2006)

    Google Scholar 

  9. Liang, P.: Semi-supervised learning for natural language. Ph.D. thesis, Massachusetts Institute of Technology (2005)

    Google Scholar 

  10. Liu, F., Chen, J., Jagannatha, A., Yu, H.: Learning for biomedical information extraction: methodological review of recent advances. arXiv preprint arXiv:1606.07993 (2016)

  11. Liu, S., Tang, B., Chen, Q., Wang, X.: Effects of semantic features on machine learning-based drug name recognition systems: word embeddings vs. manually constructed dictionaries. Information 6(4), 848–865 (2015)

    Article  Google Scholar 

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  13. Muneeb, T., Sahu, S., Anand, A.: Evaluating distributed word representations for capturing semantics of biomedical concepts. In: Proceedings of BioNLP 2015, pp. 158–163 (2015)

    Google Scholar 

  14. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investig. 30(1), 3–26 (2007)

    Article  Google Scholar 

  15. Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)

  16. QasemiZadeh, B., Handschuh, S.: Random indexing explained with high probability. In: Král, P., Matoušek, V. (eds.) TSD 2015. LNCS (LNAI), vol. 9302, pp. 414–423. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24033-6_47

    Chapter  Google Scholar 

  17. dos Santos, C.N., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1818–1826 (2014)

    Google Scholar 

  18. Tang, B., Cao, H., Wang, X., Chen, Q., Xu, H.: Evaluating word representation features in biomedical named entity recognition tasks. BioMed Res. Int. 2014 (2014)

    Google Scholar 

  19. Tsuruoka, Y., et al.: Developing a robust part-of-speech tagger for biomedical text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005). https://doi.org/10.1007/11573036_36

    Chapter  Google Scholar 

  20. Wang, Y., et al.: A comparison of word embeddings for the biomedical natural language processing. J. Biomed. Inform. 87, 12–20 (2018)

    Article  Google Scholar 

  21. Wang, Y., et al.: Clinical information extraction applications: a literature review. J. Biomed. Inform. 77, 34–49 (2018)

    Article  Google Scholar 

  22. Wishart, D.S., et al.: DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46(D1), D1074–D1082 (2017)

    Article  Google Scholar 

  23. Wu, Y., Xu, J., Jiang, M., Zhang, Y., Xu, H.: A study of neural word embeddings for named entity recognition in clinical text. In: AMIA Annual Symposium Proceedings, vol. 2015, p. 1326. American Medical Informatics Association (2015)

    Google Scholar 

Download references

Acknowledgements

This research was supported by the Precision Driven Health Partnership (www.precisiondrivenhealth.com).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamish Huggard .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huggard, H., Zhang, A., Zhang, E., Koh, Y.S. (2019). Feature Importance for Biomedical Named Entity Recognition. In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science(), vol 11919. Springer, Cham. https://doi.org/10.1007/978-3-030-35288-2_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-35288-2_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-35287-5

  • Online ISBN: 978-3-030-35288-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics