Skip to main content

Using Tagged and Untagged Corpora to Improve Thai Morphological Analysis with Unknown Word Boundary Detections

  • Conference paper
PRICAI 2012: Trends in Artificial Intelligence (PRICAI 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7458))

Included in the following conference series:

  • 2810 Accesses

Abstract

Word boundary ambiguity is a major problem for the Thai morphological analysis since the Thai words are written consecutively with no word delimiters. However the part of speech (POS) tagged corpus which has been used is constructed from the academic papers and there are no researches that worked on the documents written in the informal language. This paper presents Thai morphological analysis with unknown word boundary detection using both POS tagged and untagged corpora. Viterbi algorithm and Maximum Entropy (ME) - Viterbi algorithm are employed separately to evaluate our methods. The unknown word problem is handled by making use of string’s length in order to estimate word boundaries. The experiments are performed on documents written in formal language and documents written in informal language. The experiments show that the method we proposed to use untagged corpus in addition to tagged corpus is efficient for the text written in informal language.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Charoenporn, T., Sornlertlamvanich, V., Isahara, H.: Building A Large Thai Text Corpus - Part-Of-Speech Tagged Corpus ORCHID. In: Proceedings of the NLPRS 1997, pp. 509–512 (1997)

    Google Scholar 

  2. Sornlertlamvanich, V., Charoenporn, T., Isahara, H.: ORCHID: Thai Part-Of-Speech Tagged Corpus Technical Report TRNECTEC-1997-001, NECTEC (1997)

    Google Scholar 

  3. Boriboon, M., Kriengket, K., Chootrakool, P., Phaholphinyo, S., Purodakananda, S., Thanakulwarapas, T., Kosawat, K.: BEST Corpus Development and Analysis. In: Proceedings of the 2009 International Conference on Asian Language Processing, pp. 322–327 (2009)

    Google Scholar 

  4. Viterbi, A.J.: Viterbi, Error bounds for convolutional codes and anasymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13(2), 260–269 (1967)

    Article  MATH  Google Scholar 

  5. McCallum, A., Freitag, D., Pereira, F.: Maximum Entropy Markov Models for Information Extraction and Segmentation. In: The 17th International Conference on Machine Learning (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Luangpiensamut, W., Komiya, K., Kotani, Y. (2012). Using Tagged and Untagged Corpora to Improve Thai Morphological Analysis with Unknown Word Boundary Detections. In: Anthony, P., Ishizuka, M., Lukose, D. (eds) PRICAI 2012: Trends in Artificial Intelligence. PRICAI 2012. Lecture Notes in Computer Science(), vol 7458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32695-0_69

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32695-0_69

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32694-3

  • Online ISBN: 978-3-642-32695-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics