Skip to main content

A Study of Chinese Word Segmentation Based on the Characteristics of Chinese

  • Conference paper
Language Processing and Knowledge in the Web

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8105))

Abstract

This paper introduces the research on Chinese word segmentation (CWS). The word segmentation of Chinese expressions is difficult due to the fact that there is no word boundary in Chinese expressions and that there are some kinds of ambiguities that could result in different segmentations. To distinguish itself from the conventional research that usually emphasizes more on the algorithms employed and the workflow designed with less contribution to the discussion of the fundamental problems of CWS, this paper firstly makes effort on the analysis of the characteristics of Chinese and several categories of ambiguities in Chinese to explore potential solutions. The selected conditional random field models are trained with a quasi-Newton algorithm to perform the sequence labeling. To consider as much of the contextual information as possible, an augmented and optimized set of features is developed. The experiments show promising evaluation scores as compared to some related works.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Pak-kwong, W., Chorkin, C.: Chinese word segmentation based on maximum matching and word binding force. In: Proceedings of the 16th Conference on Computational Linguistics, COLING 1996, vol. 1, pp. 200-203. Association for Computational Linguistics, Stroudsburg (1996)

    Google Scholar 

  2. Richard, S., Willian, G., Chilin, S., Nancy, C.: A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics 22(3), 377–404 (1996)

    Google Scholar 

  3. Hua-Ping, Z., Qun, L., Xue-Qi, C., Hao, Z., Hong-Kui, Y.: Chinese lexical analysis using hierarchical hidden Markov model. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, SIGHAN 2003, vol. 17, pp. 63–70. Association for Computational Linguistics, Stroudsburg (2003)

    Google Scholar 

  4. Jin, L.K., Hwee, N.T., Wenyuan, G.: A maximum entropy approach to Chinese word segmentation. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Pro-cessing, vol. 164 (2005)

    Google Scholar 

  5. Fuchun, P., Fangfang, F., An-drew, M.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004), vol. Article 562. Association for Computational Linguistics, Stroudsburg (2004)

    Google Scholar 

  6. Ting-hao, Y., Tian-Jian, J., Chan-hung, K., Richard, T.: T-h., Wen-lian, H.: Unsupervised overlapping feature selection for conditional random fields learning in Chinese word segmentation. In: Proceedings of the 23rd Conference on Computational Linguistics and Speech Processing, ROCLING 2011, pp. 109–122. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  7. Fuchun, P., Xiangji, H., Dale, S., Nick, C.-C., Stephen, R.: Using self-supervised word segmentation in Chinese in-formation retrieval. In: Proceedings of the 25th An-nual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2002), pp. 349–350. ACM, New York (2002)

    Google Scholar 

  8. Hanshi, W., Jian, Z., Shiping, T., Xiaozhong, F.: A new unsupervised approach to word segmentation. Computational Linguistics 37(3), 421–454 (2011)

    Article  Google Scholar 

  9. Yan, S., Chunyu, K., Ruifeng, X., Hai, Z.: How unsupervised learning affects character tagging based Chinese Word Segmentation: A quantitative investigation. International Conference on Machine Learning and Cybernetics 6, 3481–3486 (2009)

    Google Scholar 

  10. Hai, Z., Chunyu, K.: Integrating unsupervised and supervised word segmentation: The role of goodness measures. Information Sciences 181(1), 163–183 (2011)

    Article  Google Scholar 

  11. John, L., Andrew, M., Ferando, P.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceeding of 18th International Conference on Machine Learning, pp. 282–289 (2001)

    Google Scholar 

  12. Shewchuk, J.R.: An introduction to the conjugate gradient method without the agonizing pain. Technical Report CMUCS-TR-94-125. Carnegie Mellon University (1994)

    Google Scholar 

  13. Michael, C., Nigel, D., Florham, P.: New ranking algorithms for parsing and tag-ging: kernels over discrete structures, and the voted perceptron. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002), pp. 263–270. Association for Computational Linguistics, Stroudsburg (2002)

    Google Scholar 

  14. The Numerical Algorithms Group: E04 - Min-imizing or Maximizing a Function, NAG Library Manual, Mark 23 (2012) (retrieved)

    Google Scholar 

  15. Peng, L., Liu, Z., Zhang, L.: A Recognition Approach Study on Chinese Field Term Based Mutual Information /Conditional Random Fields. In: 2012 International Workshop on Information and Electronics Engineering, pp. 1952–1956 (2012)

    Google Scholar 

  16. Guangjin, J., Xiao, C.: The Fourth International Chinese Language Processing Bakeoff: Chinese Word Segmentation, Name Entity Recognition and Chinese POS Tagging. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp. 83–95 (2008)

    Google Scholar 

  17. Asahara, L.J.M., Matsumoto, Y.: Analyzing Chinese Synthetic Words with Tree-based Information and a Survey on Chinese Morphologically Derived Words. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp. 53–60 (2008)

    Google Scholar 

  18. Zhang, R., Sumita, E.: Achilles: NiCT/ATR Chinese Morphological Analyzer for the Fourth Sighan Bakeoff. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp. 178–182 (2008)

    Google Scholar 

  19. Leong, K.S., Wong, F., Li., Y., Dong, M.: Chinese Tagging Based on Maximum Entropy Model. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp. 138–142 (2008)

    Google Scholar 

  20. Wu, X., Lin, X., Wang, X., Wu, C., Zhang, Y., Yu, D.: An Im-proved CRF based Chinese Language Processing System for SIGHAN Bakeoff 2007. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp. 155–160 (2008)

    Google Scholar 

  21. Qin, Y., Yuan, C., Sun, J., Wang, X.: BUPT Systems in the SIGHAN Bakeoff 2007. In: Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, pp. 94–97 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Han, A.LF., Wong, D.F., Chao, L.S., He, L., Zhu, L., Li, S. (2013). A Study of Chinese Word Segmentation Based on the Characteristics of Chinese. In: Gurevych, I., Biemann, C., Zesch, T. (eds) Language Processing and Knowledge in the Web. Lecture Notes in Computer Science(), vol 8105. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40722-2_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40722-2_12

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40721-5

  • Online ISBN: 978-3-642-40722-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics