A Local Generative Model for Chinese Word Segmentation

Zhang, Kaixu; Sun, Maosong; Xue, Ping

doi:10.1007/978-3-642-17187-1_41

Kaixu Zhang²⁰,
Maosong Sun²⁰ &
Ping Xue²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6458))

Included in the following conference series:

Asia Information Retrieval Symposium

1409 Accesses

Abstract

This paper presents a local generative model for Chinese word segmentation, which has faster learning process than discriminative models and can do unsupervised learning. It has the ability to make use of larger resources. In this model, four successive characters are used to determine whether a character interval should be a word boundary or not. The Gibbs sampling algorithm, as well as three additional rules, is applied for the unsupervised learning. Besides words, the word candidates that are generated by our model can improve the performance of Chinese information retrieval. The experiments show that in supervised learning our method outperforms a language model based method. And the performance on one corpus is better than the best one reported in SIGHAN bakeoff 05. In unsupervised learning, our method achieves the comparable performance compared to the state-of-the-art method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Xue, N.: Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing 8, 29–48 (2003)
Google Scholar
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: COLING 2004, vol. 1, pp. 562–568 (2004)
Google Scholar
Gao, J., Li, M., Wu, A., Huang, C.: Chinese word segmentation and named entity recognition: A pragmatic approach. Computational Linguistics 31, 531–574 (2005)
Article MATH Google Scholar
Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K., Isahara, H.: An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging. In: 47th Annual Meeting of the ACL, vol. 1, pp. 513–521 (2009)
Google Scholar
Goldwater, S., Griffiths, T., Johnson, M.: Contextual Dependencies in Unsupervised Word Segmentation. In: 21th Annual Meeting of the ACL, vol. 1, pp. 673–680 (2006)
Google Scholar
Mochihashi, D., Yamada, T., Ueda, N.: Bayesian unsupervised word segmentation with nested Pitman-Yor language modeling. In: 47th Annual Meeting of the ACL, vol. 1, pp. 100–108 (2009)
Google Scholar
Sun, M., Shen, D., Tsou, B.: Chinese word segmentation without using lexicon and hand-crafted training data. In: Proceedings of the 17th International Conference on Computational Linguistics, vol. 2, pp. 1265–1271 (1998)
Google Scholar
Huang, C., Šimon, P., Hsieh, S., Prévot, L.: Rethinking Chinese word segmentation: tokenization, character classification, or wordbreak identification. In: 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, vol. 1, pp. 69–72 (2007)
Google Scholar
Liu, Y., Wang, B., Ding, F., Xu, S.: Information retrieval oriented word segmentation based on character associative strength ranking. In: The Conference on EMNLP, vol. 1, pp. 1061–1069 (2008)
Google Scholar
Emerson, T.: The second international chinese word segmentation bakeoff. In: The Fourth SIGHAN Workshop on Chinese Language Processing, vol. 1, pp. 123–133 (2005)
Google Scholar
Li, Z., Sun, M.: Punctuation as implicit annotations for Chinese word segmentation. Computational Linguistics 35, 505–512 (2009)
Article Google Scholar
Teh, Y.: A Bayesian interpretation of interpolated Kneser-Ney. Technical Report (2006)
Google Scholar
Bishop, C.: Pattern recognition and machine learning. Springer, Heidelberg (2006)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

State Key Lab of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, P.R. China
Kaixu Zhang & Maosong Sun
The Boeing Company, USA
Ping Xue

Authors

Kaixu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Maosong Sun
View author publications
You can also search for this author in PubMed Google Scholar
Ping Xue
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Information Engineering, Roosevelt Road National Taiwan University, No. 1, Sec. 4, 10617, Taipei, Taiwan R.O.C.
Pu-Jen Cheng
School of Computing, National University of Singapore (NUS), Computing 1, 13 Computing Drive, 117417, Singapore
Min-Yen Kan
Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong Shatin, N.T. Hong Kong, China
Wai Lam
School of Computing, Computing 1, National University of Singapore (NUS), 13 Computing Drive, 117417, Singapore
Preslav Nakov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, K., Sun, M., Xue, P. (2010). A Local Generative Model for Chinese Word Segmentation. In: Cheng, PJ., Kan, MY., Lam, W., Nakov, P. (eds) Information Retrieval Technology. AIRS 2010. Lecture Notes in Computer Science, vol 6458. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17187-1_41

Download citation

DOI: https://doi.org/10.1007/978-3-642-17187-1_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17186-4
Online ISBN: 978-3-642-17187-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics