Skip to main content

Using Wiktionary to Improve Lexical Disambiguation in Multiple Languages

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7181))

Abstract

This paper proposes using linguistic knowledge from Wiktionary to improve lexical disambiguation in multiple languages, focusing on part-of-speech tagging in selected languages with various characteristics including English, Vietnamese, and Korean. Dictionaries and subsumption networks are first automatically extracted from Wiktionary. These linguistic resources are then used to enrich the feature set of training examples. A first-order discriminative model is learned on training data using Hidden Markov-Support Vector Machines. The proposed method is competitive with related contemporary works in the three languages. In English, our tagger achieves 96.37% token accuracy on the Brown corpus, with an error reduction of 2.74% over the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol. 1, pp. 173–180. Association for Computational Linguistics, Stroudsburg (2003)

    Chapter  Google Scholar 

  2. Spoustova, D.J., Hajic, J., Raab, J., Spousta, M.: Semi-supervised training for the averaged perceptron pos tagger. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2009, pp. 763–771. Association for Computational Linguistics, Stroudsburg (2009)

    Chapter  Google Scholar 

  3. Manning, C.D.: Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? In: Gelbukh, A.F. (ed.) CICLing 2011, Part I. LNCS, vol. 6608, pp. 171–189. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  4. Chesley, P., Vincent, B., Xu, L., Srihari, R.: Using verbs and adjectives to automatically classify blog sentiment. In: AAAI Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW), pp. 27–29 (2006)

    Google Scholar 

  5. Müller, C., Gurevych, I.: Using Wikipedia and Wiktionary in Domain-Specific Information Retrieval. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 219–226. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  6. Zesch, T., Muller, C., Gurevych, I.: Using wiktionary for computing semantic relatedness. In: Proceedings of the 23rd National Conference on Artificial Intelligence, AAAI 2008, vol. 2, pp. 861–866. AAAI Press (2008)

    Google Scholar 

  7. Meyer, C.M., Gurevych, I.: Ontowiktionary - constructing an ontology from the collaborative online dictionary wiktionary. In: Pazienza, M.T., Stellato, A. (eds.) Semi-Automatic Ontology Development: Processes and Resources. IGI Global, Hershey (2011) (to appear)

    Google Scholar 

  8. Medelyan, O., Milne, D., Legg, C., Witten, I.H.: Mining meaning from wikipedia. International Journal of Human-Computer Studies 67, 716–754 (2009)

    Article  Google Scholar 

  9. Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38, 39–41 (1995)

    Article  Google Scholar 

  10. Navarro, E., Sajous, F., Gaume, B., Prevot, L., Hsieh, S., Kuo, I., Magistry, P., Huang, C.R.: Wiktionary and NLP: Improving synonymy networks. In: Proceedings of the 2009 ACL-IJCNLP Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources, pp. 19–27. Association for Computational Linguistics, Suntec (2009)

    Chapter  Google Scholar 

  11. Meyer, C.M., Gurevych, I.: What psycholinguists know about chemistry: Aligning wiktionary and wordnet for increased domain coverage. In: Proceedings of 5th International Joint Conference on Natural Language Processing, pp. 883–892. Asian Federation of Natural Language Processing, Chiang Mai (2011)

    Google Scholar 

  12. Brill, E.: Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics 21, 543–565 (1995)

    Google Scholar 

  13. Collins, M.: Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pp. 1–8. Association for Computational Linguistics (2002)

    Google Scholar 

  14. Lee, G.G., Lee, J.H., Cha, J.: Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of korean. Comput. Linguist. 28, 53–70 (2002)

    Article  Google Scholar 

  15. Nguyen, L.M., Xuan, B.N., Viet, C.N., Nhat, M.P.Q., Shimazu, A.: A semi-supervised learning method for vietnamese Part-of-Speech tagging. In: International Conference on Knowledge and Systems Engineering, pp. 141–146. IEEE Computer Society, Los Alamitos (2010)

    Chapter  Google Scholar 

  16. Jiang, W., Huang, L., Liu, Q., Lu, Y.: A cascaded linear model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of ACL-2008: HLT, pp. 897–904. Association for Computational Linguistics, Columbus (2008)

    Google Scholar 

  17. Kim, H.: Korean national corpus in the 21st century sejong project (2003)

    Google Scholar 

  18. Han, C., Han, N., Ko, E., Palmer, M.: Penn Korean Treebank: Development and Evaluation. In: Proceedings of the Pacific Asian Conference of Language and Computation (2002)

    Google Scholar 

  19. Pham, D.D., Tran, G.B., Pham, S.B.: A hybrid approach to vietnamese word segmentation using part of speech tags. In: Proceedings of the 2009 International Conference on Knowledge and Systems Engineering, KSE 2009, pp. 154–161. IEEE Computer Society, Washington, DC (2009)

    Chapter  Google Scholar 

  20. Nguyen, P.T., Vu, X.L., Nguyen, T.M.H., Nguyen, V.H., Le, H.P.: Building a large syntactically-annotated corpus of vietnamese. In: Proceedings of the Third Linguistic Annotation Workshop, pp. 182–185. Association for Computational Linguistics, Suntec (2009)

    Chapter  Google Scholar 

  21. Nguyen, K.H., Ock, C.Y.: Margin perceptron for word sense disambiguation. In: Proceedings of the 2010 Symposium on Information and Communication Technology, SoICT 2010, pp. 64–70. ACM, New York (2010)

    Chapter  Google Scholar 

  22. Nguyen, K.H., Ock, C.Y.: Word sense disambiguation as a traveling salesman problem. Artificial Intelligence Review (2011) (online first)

    Google Scholar 

  23. Altun, Y., Tsochantaridis, I., Hofmann, T.: Hidden markov support vector machines. In: Fawcett, T., Mishra, N. (eds.) ICML, pp. 3–10. AAAI Press (2003)

    Google Scholar 

  24. Kucera, H., Francis, W.N.: Computational analysis of present-day American English. Brown University Press, Providence (1967)

    Google Scholar 

  25. Ahn, Y.M., Seo, Y.H.: Korean part-of-speech tagging using disambiguation rules for ambiguous word and statistical information. In: Proceedings of the 2007 International Conference on Convergence Information Technology, ICCIT 2007, pp. 1598–1601. IEEE Computer Society, Washington, DC (2007)

    Chapter  Google Scholar 

  26. Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of english: the penn treebank. Comput. Linguist. 19, 313–330 (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nguyen, KH., Ock, CY. (2012). Using Wiktionary to Improve Lexical Disambiguation in Multiple Languages. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28604-9_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28603-2

  • Online ISBN: 978-3-642-28604-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics