Skip to main content

Incorporating Dictionary Features into Conditional Random Fields for Gene/Protein Named Entity Recognition

  • Conference paper
Emerging Technologies in Knowledge Discovery and Data Mining (PAKDD 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4819))

Included in the following conference series:

Abstract

Biomedical Named Entity Recognition (BioNER) is an important preliminary step for biomedical text mining. Previous researchers built dictionaries of gene/protein names from online databases and incorporated them into machine learning models as features, but the effects were very limited. This paper gives a quality assessment of four dictionaries derived form online resources, and investigate the impacts of two factors (i.e., dictionary coverage and noisy terms) that may lead to the poor performance of dictionary features. Experiments are performed by comparing performances of the external dictionaries and a dictionary derived from GENETAG corpus, using Conditional Random Fields (CRFs) with dictionary features. We also make observations of the impacts regarding long names and short names. The results show that low coverage of long names and noises of short names are the main problems of current online resources and a high quality dictionary could substantially improve the accuracy of BioNER.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cohen, A.M, Hersh, W.R.: A survey of current work in biomedical text mining. Briefings in Bioinformatics 6(1), 57–71 (2005)

    Article  Google Scholar 

  2. Bikel, D., Schwartz, R., Weischedel, R.: An algorithm that learns what’s in a name. Machine Learning 34, 211–231 (1997)

    Article  Google Scholar 

  3. Tjong, E.F., Sang, K., De Meulder, F.: Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning (CoNLL-2003), pp. 142–147 (2003)

    Google Scholar 

  4. Kim, J.D, Tomoko, O., Yoshimasa, T., et al.: Introduction to the Bio-Entity Recognition Task at JNLPBA. In: Proceedings of the International Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-04), pp. 70–75 (2004)

    Google Scholar 

  5. Zhou, G., Su, J.: Exploring Deep Knowledge Resources in Biomedical Name Recognition. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), pp. 96–99 (2004)

    Google Scholar 

  6. Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6(1), 1 (2005)

    Article  Google Scholar 

  7. Finkel, J., Dingare, S., Manning, C.D.: Exploring the boundaries: gene and protein identification in biomedical text. BMC Bioinformatics 6(1), S5 (2005)

    Article  Google Scholar 

  8. McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of The Seventeenth International Conference on Machine Learning, pp. 591–598. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  9. Liu, H., Hu, Z., Torii, M., Wu, C., Friedman, C.: Quantitative Assessment of Dictionary-based Protein Named Entity Tagging. Journal of the American Medical Informatics Association 13(5), 497–507 (2006)

    Article  MATH  Google Scholar 

  10. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann, San Francisco, CA (2001)

    Google Scholar 

  11. Settles, B.: Biomedical Named Entity Recognition Using Conditional Random Fields and Novel Feature Sets. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), pp. 104–107 (2004)

    Google Scholar 

  12. Tanabe, L., Xie, N., Thom, L.H., Matten, W., Wilbur, W.J.: GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6(1) (2005)

    Google Scholar 

  13. Cohen, A.M.: Unsupervised gene/protein entity normalization using automatically extracted dictionaries. In: Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, Proceedings of the BioLINK2005 Workshop; Detroit, MI: Association for Computational Linguistics, pp. 17–24 (2005)

    Google Scholar 

  14. Tanabe, L., Wilbur, W.J.: Generation of a Large Gene/Protein Lexicon by Morphological Pattern Analysis. Journal of Bioinformatics and Computational Biology 1(4), 611–626 (2004)

    Article  Google Scholar 

  15. Tanabe, L., Wilbur, W.J.: Tagging gene and protein names in biomedical text. Bioinformatics 18(8), 1124–1132 (2002)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Takashi Washio Zhi-Hua Zhou Joshua Zhexue Huang Xiaohua Hu Jinyan Li Chao Xie Jieyue He Deqing Zou Kuan-Ching Li Mário M. Freire

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lin, H., Li, Y., Yang, Z. (2007). Incorporating Dictionary Features into Conditional Random Fields for Gene/Protein Named Entity Recognition. In: Washio, T., et al. Emerging Technologies in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77018-3_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77018-3_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77016-9

  • Online ISBN: 978-3-540-77018-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics