Skip to main content

Parsing Biomedical Literature

  • Conference paper
Natural Language Processing – IJCNLP 2005 (IJCNLP 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3651))

Included in the following conference series:

Abstract

We present a preliminary study of several parser adaptation techniques evaluated on the GENIA corpus of MEDLINE abstracts [1,2]. We begin by observing that the Penn Treebank (PTB) is lexically impoverished when measured on various genres of scientific and technical writing, and that this significantly impacts parse accuracy. To resolve this without requiring in-domain treebank data, we show how existing domain-specific lexical resources may be leveraged to augment PTB-training: part-of-speech tags, dictionary collocations, and named-entities. Using a state-of-the-art statistical parser [3] as our baseline, our lexically-adapted parser achieves a 14.2% reduction in error. With oracle-knowledge of named-entities, this error reduction improves to 21.2%.

We would like to thank the National Science Foundation for their support of this work (IIS-0112432, LIS-9721276, and DMS-0074276), as well as thank Sharon Goldwater and our anonymous reviewers for their valuable feeback.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kim, J.d., Ohta, T., Tateisi, Y., Tsujii, J.: Genia corpus - a semantically annotated corpus for bio-textmining. Bioinformatics (Supplement: Eleventh International Conference on Intelligent Systems for Molecular Biology) 19, i180–i182 (2003)

    Google Scholar 

  2. Tateisi, Y., Ohta, T., dong Kim, J., Hong, H., Jian, S., Tsujii, J.: The genia corpus: Medline abstracts annotated with linguistic information. In: Third meeting of SIG on Text Mining, Intelligent Systems for Molecular Biology, ISMB (2003)

    Google Scholar 

  3. Charniak, E.: A maximum-entropy-inspired parser. In: Proc. NAACL, pp. 132–139 (2000)

    Google Scholar 

  4. Marcus, M., Santorini, B., Marcinkiewicz, M.A.: Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19, 313–330 (1993)

    Google Scholar 

  5. Collins, M.: Discriminative reranking for natural language parsing. In: Proc. ICML, pp. 175–182 (2000)

    Google Scholar 

  6. Ratnaparkhi, A.: Learning to parse natural language with maximum entropy models. Machine Learning 34, 151–175 (1999)

    Article  MATH  Google Scholar 

  7. Gildea, D.: Corpus variation and parser performance. In: Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, pp. 167–202 (2001)

    Google Scholar 

  8. Roark, B., Bacchiani, M.: Supervised and unsupervised pcfg adaptation to novel domains. In: Proceedings of HLT-NAACL, pp. 205–212 (2003)

    Google Scholar 

  9. Steedman, M., Hwa, R., Clark, S., Osborne, M., Sarkar, A., Hockenmaier, J., Ruhlen, P., Baker, S., Crim, J.: Example selection for bootstrapping statistical parsers. In: Proceedings of HLT-NAACL, pp. 331–338 (2003)

    Google Scholar 

  10. de Bruijn, B., Martin, J.: Literature mining in molecular biology. In: Proceedings of the European Federation for Medical Informatics (EFMI) Workshop on Natural Language Processing in Biomedical Applications (2002)

    Google Scholar 

  11. Hirschman, L., Park, J., Tsujii, J., Wong, L., Wu, C.: Accomplishments and challenges in literature data mining for biology. Bioinformatics 18, 1553–1561 (2002)

    Article  Google Scholar 

  12. Yakushiji, A., Tateisi, Y., Miyao, Y., Tsujii, J.: Event extraction from biomedical papers using a full parser. In: Pacific Symposium on Biocomputing, pp. 408–419 (2001)

    Google Scholar 

  13. Daraselia, N., Yuryev, A., Egorov, S., Novichkova, S., Nikitin, A., Mazo, I.: Extracting human protein interactions from medline using a full-sentence parser. Bioinformatics 20, 604–611 (2004)

    Article  Google Scholar 

  14. Shatkay, H., Feldman, R.: Mining the biomedical literature in the genomic era: An overview. Journal of Computational Biology 10, 821–855 (2003)

    Article  Google Scholar 

  15. Hwa, R.: Learning Probabilistic Lexicalized Grammars for Natural Language Processing. PhD thesis, Harvard University (2001)

    Google Scholar 

  16. Bies, A., Ferguson, M., Katz, K., MacIntyre, R.: Bracketting Guideliness for Treebank II style Penn Treebank Project. Linguistic Data Consortium (1995)

    Google Scholar 

  17. Buckley, C.: Implementation of the smart information retrieval system. Technical Report 85-686, Cornell University (1985)

    Google Scholar 

  18. Goodman, J.: Parsing inside-out. PhD thesis, Harvard University (1998)

    Google Scholar 

  19. McCray, A.T., Srinivasan, S., Browne, A.C.: Lexical methods for managing variation in biomedical terminologies. In: Proceedings of the 18th Annual Symposium on Computer Applications in Medical Care (SCAMC), pp. 235–239 (1994)

    Google Scholar 

  20. Grover, C., Lapata, M., Lascarides, A.: A comparison of parsing technologies for the biomedical domain. Journal of Natural Language Engineering (2002)

    Google Scholar 

  21. Surdeanu, M., Harabagiu, S., Williams, J., Aarseth, P.: Using predicate-argument structures for information extraction. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL-2003), pp. 8–15 (2003)

    Google Scholar 

  22. Miyao, Y., Ninomiya, T., Tsujii, J.: Corpus-oriented grammar development for acquiring a head-driven phrase structure grammar from the penn treebank. In: Proc. of IJCNLP-2004, pp. 684–693 (2004)

    Google Scholar 

  23. Zhou, G., Su, J.: Exploring deep knowledge resources in biomedical name recognition. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications, JNLPBA-2004 (2004)

    Google Scholar 

  24. Charniak, E.: Statistical parsing with a context-free grammar and word statistics. In: Proceedings of the Fourteenth National Conference on Artificial Intelligence. AAAI Press/MIT Press, Menlo Park (1997)

    Google Scholar 

  25. Park, J.C.: Using combinatory categorical grammar to extract biomedical information. IEEE Intelligent Systems 16, 62–67 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lease, M., Charniak, E. (2005). Parsing Biomedical Literature. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_6

Download citation

  • DOI: https://doi.org/10.1007/11562214_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29172-5

  • Online ISBN: 978-3-540-31724-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics