Abstract
The main aim of this work is to implement stochastic Korean Word-Spacing System which is equally robust for both inner-data and external-data. Word-spacing in Korean is influential in deciding semantic and syntactic scope. In order to cope with various problem yielded by word-spacing errors while processing Korean text, this study (a) presents a simple stochastic word-spacing system with only two parameters using relative word-unigram frequencies and odds favoring the inner-spacing probability of disyllables located at the boundary of stochastic-based words; (b) endeavors to diminish training-data-dependency by dynamically creating candidate words list with the longest-radix-selecting algorithm and (c) removes noise from the training-data by refining training procedure. The system thus becomes robust against unseen words and offers similar performance for both inner-data and external-data: it obtained 98.35% and 97.47% precision in word-unit correction from the inner test-data and the external test-data, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chung, Y.M., Lee, J.Y.: Automatic Word-segmentation at Line-breaks for Korean Text Processing. In: Proceedings of 6th Conference of Korean Society for Information Management, pp. 21–24 (1999)
Kang, M.Y., Kwon, H.C.: Improving Word Spacing Correction Methods for Efficient Text Processing. Proceedings of the Korean Information Science Society (B) 30. 1, 486–488 (2003)
Kang, M.Y., Park, S.H., Yoon, A.S., Kwon, H.C.: Potential Governing Relationship and a Korean Grammar Checker Using Partial Parsing. In: Hendtlass, T., Ali, M. (eds.) IEA/AIE 2002. LNCS (LNAI), vol. 2358, pp. 692–702. Springer, Heidelberg (2002)
Kang, S.S.: Automatic Segmentation for Hangul Sentences. In: Proceeding of the 10th Confer- ence on Hangul and Korean Information Processing, pp. 137–142 (1998)
Kang, S.S., Woo, C.W.: Automatic Segmentation of Words Using Syllable Bigram Statistics. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, pp. 729–732 (2001)
Kim, S.N., Nam, H.S., Kwon, H.C.: Correction Methods of Spacing Words for Improving the Korean Spelling and Grammar Checkers. In: Proceedings of 5th Natural Language Processing Pacific Rim Symposium, pp. 415–419 (1999)
Lee, D.K., Lee, S.Z., Lim, H.S., Rim, H.C.: Two Statistical Models for Automatic Word Spacing of Korean Sentences. Journal of KISS(B): Software and Applications 30. 4, 358–370 (2003)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (2001)
Sim, C.M., Kwon, H.C.: Implementation of a Korean Spelling Checker Based on Collocation of Words. Journal of KISS(B): Software and Applications 23. 7, 776–785 (1996)
Sim, K.S.: Automated Word-Segmentation for Korean Using Mutual Information of Syllables. Journal of KISS(B): Software and Applications 23. 9, 991–1000 (1996)
Yoon, K.S., Kang, M.Y., Kwon, H.C.: Improving Word Spacing Correction Methods Using Heuristic Clues. In: Proceedings of the EALPIIT 2003, pp. 5–11 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kang, My., Choi, Sj., Yoon, As., Kwon, Hc. (2005). Korean Stochastic Word-Spacing with Dynamic Expansion of Candidate Words List. In: Su, KY., Tsujii, J., Lee, JH., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2004. IJCNLP 2004. Lecture Notes in Computer Science(), vol 3248. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30211-7_31
Download citation
DOI: https://doi.org/10.1007/978-3-540-30211-7_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24475-2
Online ISBN: 978-3-540-30211-7
eBook Packages: Computer ScienceComputer Science (R0)