Abstract
In the context of Part-of-Speech (PoS)-tagging of specialized corpora, we proposed an inductive approach focusing on the most ‘important’ PoStags because mistaking them can lead to a total misunderstanding of the text. After a standard tagging of a biological corpus by Brill’s tagger, we noted persistent errors that are very hard to deal with. As an application, we studied two cases of different nature: first, confusion between past participle, adjective and preterit for verbs that end with ‘ed’; second, confusion between plural nouns and verbs, 3rd person singular present. With a friendly user interface, the expert corrected the examples. Then, from these well-annotated examples, we induced rules using a propositional rule induction algorithm. Experimental validation showed improvement in tagging precision. The relevance of the terminology of the considered field, here molecular biology, is greatly improved when the number of these tagging errors decreases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Amrani, A., Azé, J., Heitz, T., Kodratoff, Y., Roche, M.: From the texts to the concepts they contain: a chain of linguistic treatments. In: Proceedings of TREC 2004 (Text REtrieval Conference), National Institute of Standards and Technology, Gaithersburg Maryland USA, pp. 712–722 (2004)
Brill, E.: Some Advances in Transformation-Based Part of Speech Tagging, vol. 1, pp. 722–727. AAAI, Menlo Park (1994)
Amrani, A., Kodratoff, Y., Matte-Tailliez, O.: A Semi-automatic System for Tagging Specialized Corpora. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 670–681. Springer, Heidelberg (2004)
Cussens, J.: Part-of-speech tagging using Progol. In: Džeroski, S., Lavrač, N. (eds.) ILP 1997. LNCS, vol. 1297, pp. 93–108. Springer, Heidelberg (1997)
Eineborg, M., Lindberg, N.: ILP in Part-of-Speech Tagging - An Overview. In: Cussens, J., Džeroski, S. (eds.) LLL 1999. LNCS (LNAI), vol. 1925, p. 157. Springer, Heidelberg (2000)
Lindberg, N., Eineborg, M.: Learning Constraint Grammar-style Disambiguation Rules using Inductive Logic Programming. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pp. 775–779 (1998)
Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: MBT: A Memory-Based Part of Speech Tagger-Generator. In: Ejerhed, E., Dagan, I. (eds.) Proceedings of the 4th Workshop on Very Large Corpora, Copenhagen, pp. 14–27 (1996)
Zavrel, J., Daelemans, W.: Recent Advances in Memory-Based Part-of-Speech Tagging. In: Actas del VI Simposio Internacional de Comunicacion Social, Santiago de Cuba, pp. 590–597 (1999)
Marquez, L., Rodriguez, H.: Part-of-Speech Tagging Using Decision Trees. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 25–36. Springer, Heidelberg (1998)
Brants, T.: TnT - A Statistical Part- of-Speech Tagger. In: Proceedings of the 6th Conference on Applied Natural Language Processing, Seattle (2000)
Cutting, D., Kupiec, J., Pedersen, J., Sibun, P.: A practicial part-of-speech tagger. In: Proceedings of the 3rd Conference on Applied Natural Language Processing (1992)
Brill, E., Wu, J.: Classifier Combination for Improved Lexical Disambiguation. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (1998)
Halteren, V., Zavrel, J., Daelemans, W.: Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems. Computational linguistics 27, 199–229 (2001)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Cohen, W.: Fast Effective Rule Induction. In: Proceedings of the 12th ICML (1995)
Frank, E., Witten, I.H.: Generating Accurate Rule Sets Without Global Optimization. In: Shavlik, J. (ed.) Proceedings of the 15th International Conference on Machine Learning, Madison, Wisconsin, pp. 144–151 (1998)
Amrani, A., Azé, J., Kodratoff, Y.: Etiq: Logiciel d’aide à l’étiquetage morpho-syntaxique de textes de spécialité. Dans la revue RNTI, numéro spécial EGC 2005 (session démonstrations) E3, 673–678 (2005)
Halliday, M.A.K.: System and Function in Language. Oxford University Press, London (1976)
Roche, M., Heitz, T., Matte-Tailliez, O., Kodratoff, Y.: EXIT: Un système itératif pour l’extraction de la terminologie du domaine à partir de corpus spécialisés. In: International Conference on Statistical Analysis of Textual Data (JADT 2004), pp. 946–956 (2004)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16, 22–29 (1990)
Daille, B., Gaussier, E., Langé, J.: An evaluation of statistical scores for word association. In: The Tbilisi Symposium on Logic, Language and Computation, pp. 177–188. CSLI Publications, Stanford (1998)
Nerima, L., Seretan, V., Wehrli, E.: Creating a multilingual collocations dictionary from large text corpora. In: Proceedings of Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 131–134 (2003)
Roche, M.: Intégration de la construction de la terminologie de domaines spécialisés dans un processus global de fouille de textes. Thèse de Doctorat en Informatique (PhD thesis), Université Paris-Sud, France (2004)
Xu, F., Kurz, D., Piskorski, J., Schmeier, S.: A Domain Adaptive Approach to Automatic Acquisition of Domain Relevant Terms and their Relations with Bootstrapping. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation (2002)
Dunning, T.E.: Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics 19(1), 61–74 (1993)
Thanopoulos, A., Fakotakis, N., Kokkianakis, G.: Comparative Evaluation of Collocation Extraction Metrics. In: Proceedings of 3rd International Conference on Language Resources and Evaluation (LREC 2002), vol. 2, pp. 620–625 (2002)
Aussenac-Gilles, N., Bourigault, D.: The Th(IC)2 Initiative: Corpus-Based Thesaurus Construction for Indexing WWW Documents. In: Proceedings of the EKAW 2000 Workshop on Ontologies and Texts, vol. 51 (2000)
Alphonse, E., Rouveirol, C.: Lazy Propositionalisation for Relational Learning. In: Proceedings of the 14th European Conference on Artificial Intelligence, pp. 256–260 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Amrani, A., Roche, M., Kodratoff, Y., Matte-Tailliez, O. (2005). Inductive Improvement of Part-of-Speech Tagging and Its Effect on a Terminology of Molecular Biology. In: Kégl, B., Lapalme, G. (eds) Advances in Artificial Intelligence. Canadian AI 2005. Lecture Notes in Computer Science(), vol 3501. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424918_38
Download citation
DOI: https://doi.org/10.1007/11424918_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25864-3
Online ISBN: 978-3-540-31952-8
eBook Packages: Computer ScienceComputer Science (R0)