Abstract
The challenge of knowledge management in the pharmaceutical industry is twofold. First it has to address the integration of sequence data with the vast and growing body of data from functional analysis of genes with the information in huge historical archival databases. Second, as the number of biomedical publications exponentially increases (Medline now contains more than 13 million records), researchers require assistance in order to broaden their vision and comprehension of scientific domains. Analogous to data mining in the sense that it uncovers relationships in information, text mining uncovers relationships in a text collection and leverages the creativity of the knowledge worker in the exploration of these relationships and in the discovery of new knowledge. We describe herein a text mining method to automatically detect protein interactions which are described across a large amount of scientific publications. This method relies on natural language processing to identify protein names, their synonyms and the various interactions they can bear with other proteins. We have then compared text mining analysis on abstracts to the same kind of analysis on full text articles to assess how much information is lost when only abstracts are processed. Our results show that: 1)LexiQuest Mine is a very versatile and accurate tool when mining biomedical literature to analyze interactions between proteins. 2)Mining only abstracts can be sufficient and time saving for applications that do not require a high level of detail on a large scale whereas mining full text articles is to be chosen for more exhaustive applications designed to address a specific issue. Availability: LexiQuest Mine is available for commercial licensing from SPSS, Inc.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
National Library of Medicine’s bibliographic database, at http://www.ncbi.nlm.nih.gov
Fukuda, K., et al.: Toward information extraction: identifying protein names from biological papers. Pac. Symp. Biocomput., 707–718 (1998)
Narayanaswamy, M., Ravikumar, K.E., Vijay-Shanker, K.: A biological named entity recognizer. Pac. Symp. Biocomput., 427–438 (2003)
Krauthammer, M., et al.: Using BLAST for identifying gene and protein names in journal articles. Gene 259(1-2), 245–252 (2000)
Hanisch, D., et al.: Playing biology’s name game: identifying protein names in scientific text. Pac. Symp. Biocomput., 403–414 (2003)
Egorov, S., Yuryev, A., Daraselia, N.: A simple and practical dictionary-based approach for identification of proteins in Medline abstracts. J Am. Med. Inform. Assoc. 11(3), 174–178 (2004)
Hatzivassiloglou, V., Duboue, P.A., Rzhetsky, A.: Disambiguating proteins, genes, and RNA in text: a machine learning approach. Bioinformatics 17, S97–S106 (2001)
Wilbur, W.J., et al.: Analysis of biomedical text for chemical names: a comparison of three methods. In: Proc AMIA Symp, pp. 176–180 (1999)
Collier, N., Nobata, C., Tsujii, T.: Extraction of name of genes and gene products with a Hidden Markov Model. In: COLING conference proceedings (2000)
Kazama, J., et al.: Tuning Support Vector Machines for Biomedical Named Entity Recognition. In: Proceedings of the Natural Language Processing in the Biomedical Domain (2002)
Ono, T., et al.: Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics 17(2), 155–161 (2001)
Wong, L.: PIES, a protein interaction extraction system. Pac Symp Biocomput, 520–531 (2001)
Humphreys, K., Demetriou, G., Gaizauskas, R.: Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. Pac. Symp. Biocomput., 505–516 (2000)
Park, J.C., Kim, H.S., Kim, J.J.: Bidirectional incremental parsing for automatic path- way identification with combinatory categorial grammar. Pac. Symp. Biocomput., 396–407 (2001)
Pustejovsky, J., et al.: Robust relational parsing over biomedical literature: extracting inhibit relations. Pac. Symp. Biocomput., 362–373 (2002)
Yakushiji, A., et al.: Event extraction from biomedical papers using a full parser. Pac Symp Biocomput, 408–419 (2001)
Sekimizu, T., Park, H.S., Tsujii, J.: Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. In: Genome Inform Ser Workshop Genome Inform, vol. 9, pp. 62–71 (1998)
Rindflesch, T.C., et al.: EDGAR: extraction of drugs, genes and relations from the bio- medical literature. Pac. Symp. Biocomput., 517–528 (2000)
Ng, S.K., Wong, M.: Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts. In: Genome Inform Ser Workshop Genome Inform, vol. 10, pp. 104–112 (1999)
Corney, D.P., et al.: BioRAT: extracting biological information from full-length papers. Bioinformatics 20(17), 3206–3213 (2004)
http://www.spss.com/lexiquest/text_mining_for_clementine.htm
Franzen, K., et al.: Protein names and how to find them. Int. J Med. Inf. 67(1-3), 49–61 (2002)
Tanabe, L., Wilbur, W.J.: Tagging gene and protein names in biomedical text. Bioinformatics 18(8), 1124–1132 (2002)
Blaschke, C., Valencia, A.: The potential use of SUISEKI as a protein interaction discovery tool. Genome Inform Ser Workshop Genome Inform 12, 123–134 (2001)
Hu, X., et al.: Extracting and Mining Protein-Protein InteractionNetwork from Biomedical Literature. In: Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2004)
Daraselia, N., et al.: Extracting human protein interactions from MEDLINE using a full- sentence parser. Bioinformatics 20(5), 604–611 (2004)
Huang, M., et al.: Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics (2004)
Marcotte, E.M., Xenarios, I., Eisenberg, D.: Mining literature for protein-protein interactions. Bioinformatics 17(4), 359–363 (2001)
Temkin, J.M., Gilder, M.R.: Extraction of protein interaction information from unstructured text using a context-free grammar. Bioinformatics 19(16), 2046–2053 (2003)
General Architecture for Text Engineering: http://gate.ac.uk/
Eisen, M.B., et al.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95(25), 14863–14868 (1998)
Wen, X., et al.: Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl. Acad. Sci. USA 95(1), 334–339 (1998)
Tamayo, P., et al.: Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U S A 96(6), 2907–2912 (1999)
Brown, M.P., et al.: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. U S A 97(1), 262–267 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Martin, E.P.G., Bremer, E.G., Guerin, MC., DeSesa, C., Jouve, O. (2004). Analysis of Protein/Protein Interactions Through Biomedical Literature: Text Mining of Abstracts vs. Text Mining of Full Text Articles. In: López, J.A., Benfenati, E., Dubitzky, W. (eds) Knowledge Exploration in Life Science Informatics. KELSI 2004. Lecture Notes in Computer Science(), vol 3303. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30478-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-540-30478-4_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23927-7
Online ISBN: 978-3-540-30478-4
eBook Packages: Springer Book Archive