Using Wiktionary to Improve Lexical Disambiguation in Multiple Languages

Nguyen, Kiem-Hieu; Ock, Cheol-Young

doi:10.1007/978-3-642-28604-9_20

Kiem-Hieu Nguyen¹⁷ &
Cheol-Young Ock¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7181))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2019 Accesses
3 Citations
1 Altmetric

Abstract

This paper proposes using linguistic knowledge from Wiktionary to improve lexical disambiguation in multiple languages, focusing on part-of-speech tagging in selected languages with various characteristics including English, Vietnamese, and Korean. Dictionaries and subsumption networks are first automatically extracted from Wiktionary. These linguistic resources are then used to enrich the feature set of training examples. A first-order discriminative model is learned on training data using Hidden Markov-Support Vector Machines. The proposed method is competitive with related contemporary works in the three languages. In English, our tagger achieves 96.37% token accuracy on the Brown corpus, with an error reduction of 2.74% over the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol. 1, pp. 173–180. Association for Computational Linguistics, Stroudsburg (2003)
Chapter Google Scholar
Spoustova, D.J., Hajic, J., Raab, J., Spousta, M.: Semi-supervised training for the averaged perceptron pos tagger. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009, pp. 763–771. Association for Computational Linguistics, Stroudsburg (2009)
Chapter Google Scholar
Manning, C.D.: Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In: Gelbukh, A.F. (ed.) CICLing 2011, Part I. LNCS, vol. 6608, pp. 171–189. Springer, Heidelberg (2011)
Chapter Google Scholar
Chesley, P., Vincent, B., Xu, L., Srihari, R.: Using verbs and adjectives to automatically classify blog sentiment. In: AAAI Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW), pp. 27–29 (2006)
Google Scholar
Müller, C., Gurevych, I.: Using Wikipedia and Wiktionary in Domain-Specific Information Retrieval. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 219–226. Springer, Heidelberg (2009)
Chapter Google Scholar
Zesch, T., Muller, C., Gurevych, I.: Using wiktionary for computing semantic relatedness. In: Proceedings of the 23rd National Conference on Artificial Intelligence, AAAI 2008, vol. 2, pp. 861–866. AAAI Press (2008)
Google Scholar
Meyer, C.M., Gurevych, I.: Ontowiktionary - constructing an ontology from the collaborative online dictionary wiktionary. In: Pazienza, M.T., Stellato, A. (eds.) Semi-Automatic Ontology Development: Processes and Resources. IGI Global, Hershey (2011) (to appear)
Google Scholar
Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from wikipedia. International Journal of Human-Computer Studies 67, 716–754 (2009)
Article Google Scholar
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38, 39–41 (1995)
Article Google Scholar
Navarro, E., Sajous, F., Gaume, B., Prevot, L., Hsieh, S., Kuo, I., Magistry, P., Huang, C.R.: Wiktionary and NLP: Improving synonymy networks. In: Proceedings of the 2009 ACL-IJCNLP Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 19–27. Association for Computational Linguistics, Suntec (2009)
Chapter Google Scholar
Meyer, C.M., Gurevych, I.: What psycholinguists know about chemistry: Aligning wiktionary and wordnet for increased domain coverage. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 883–892. Asian Federation of Natural Language Processing, Chiang Mai (2011)
Google Scholar
Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics 21, 543–565 (1995)
Google Scholar
Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pp. 1–8. Association for Computational Linguistics (2002)
Google Scholar
Lee, G.G., Lee, J.H., Cha, J.: Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of korean. Comput. Linguist. 28, 53–70 (2002)
Article Google Scholar
Nguyen, L.M., Xuan, B.N., Viet, C.N., Nhat, M.P.Q., Shimazu, A.: A semi-supervised learning method for vietnamese Part-of-Speech tagging. In: International Conference on Knowledge and Systems Engineering, pp. 141–146. IEEE Computer Society, Los Alamitos (2010)
Chapter Google Scholar
Jiang, W., Huang, L., Liu, Q., Lu, Y.: A cascaded linear model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL-2008: HLT, pp. 897–904. Association for Computational Linguistics, Columbus (2008)
Google Scholar
Kim, H.: Korean national corpus in the 21st century sejong project (2003)
Google Scholar
Han, C., Han, N., Ko, E., Palmer, M.: Penn Korean Treebank: Development and Evaluation. In: Proceedings of the Pacific Asian Conference of Language and Computation (2002)
Google Scholar
Pham, D.D., Tran, G.B., Pham, S.B.: A hybrid approach to vietnamese word segmentation using part of speech tags. In: Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, KSE 2009, pp. 154–161. IEEE Computer Society, Washington, DC (2009)
Chapter Google Scholar
Nguyen, P.T., Vu, X.L., Nguyen, T.M.H., Nguyen, V.H., Le, H.P.: Building a large syntactically-annotated corpus of vietnamese. In: Proceedings of the Third Linguistic Annotation Workshop, pp. 182–185. Association for Computational Linguistics, Suntec (2009)
Chapter Google Scholar
Nguyen, K.H., Ock, C.Y.: Margin perceptron for word sense disambiguation. In: Proceedings of the 2010 Symposium on Information and Communication Technology, SoICT 2010, pp. 64–70. ACM, New York (2010)
Chapter Google Scholar
Nguyen, K.H., Ock, C.Y.: Word sense disambiguation as a traveling salesman problem. Artificial Intelligence Review (2011) (online first)
Google Scholar
Altun, Y., Tsochantaridis, I., Hofmann, T.: Hidden markov support vector machines. In: Fawcett, T., Mishra, N. (eds.) ICML, pp. 3–10. AAAI Press (2003)
Google Scholar
Kucera, H., Francis, W.N.: Computational analysis of present-day American English. Brown University Press, Providence (1967)
Google Scholar
Ahn, Y.M., Seo, Y.H.: Korean part-of-speech tagging using disambiguation rules for ambiguous word and statistical information. In: Proceedings of the 2007 International Conference on Convergence Information Technology, ICCIT 2007, pp. 1598–1601. IEEE Computer Society, Washington, DC (2007)
Chapter Google Scholar
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of english: the penn treebank. Comput. Linguist. 19, 313–330 (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical Engineering, University of Ulsan, 93, Daehakro, Nam-gu, Ulsan, 680-749, Korea
Kiem-Hieu Nguyen & Cheol-Young Ock

Authors

Kiem-Hieu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Cheol-Young Ock
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research (CIC), National Polytechnic Institute (IPN), Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, KH., Ock, CY. (2012). Using Wiktionary to Improve Lexical Disambiguation in Multiple Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_20

Download citation

DOI: https://doi.org/10.1007/978-3-642-28604-9_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28603-2
Online ISBN: 978-3-642-28604-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics