Skip to main content

Automatically Build Corpora for Chinese Spelling Check Based on the Input Method

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11838))

Abstract

Chinese Spelling Check (CSC) is very important for Chinese language processing. To utilize supervised learning for CSC, one of the main challenges is that high-quality annotated corpora are not enough in building models. This paper proposes new approaches to automatically build the corpora of CSC based on the input method. We build two corpora: one is used to check the errors in the texts generated by the Pinyin input method, called p-corpus, and the other is used to check the errors in the texts generated by the voice input method, called v-corpus. The p-corpus is constructed using two methods, one is based on the conversion between Chinese characters and the sounds of the characters, and the other is based on Automatic Speech Recognition (ASR). The v-corpus is constructed based on ASR. We use the misspelled sentences in real language situation as the test set. Experimental results demonstrate that our corpora can get a better checking effect than the benchmark corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Pinyin is the annotation of Chinese pronunciation. https://en.wikipedia.org/wiki/Pinyin.

  2. 2.

    Chinese tones range from 1 to 4.

  3. 3.

    According to [6], sound edit distance 1 covers about 90% of spelling errors, and sound edit distance 2 accounts for almost all of the remaining spelling errors. Thus we consider two characters with sound edit distances 1 or 2 as similar characters. Such as “ ” (zhen4 “shock”) and “ ” (zheng4 “positive”), their sound edit distance is 1; hence, they are similar characters.

  4. 4.

    According to statistics, there are 11 groups of fuzzy sounds in Chinese characters: z-zh, c-ch, s-sh, l-n, f-h, r-l, an-ang, en-eng, in-ing, ian-iang, uan-uang.

  5. 5.

    https://catalog.ldc.upenn.edu/LDC2011T13, these articles reported have undergone a rigorous editing process and are considered to be all correct.

  6. 6.

    http://www.openslr.org/resources/33/data_aishell, this speech library is transcoded by professional voice proofreaders and pass strict quality inspection. The correct rate of AlShell is above \(95\%\).

  7. 7.

    The word segmentation tool used in this paper is jieba. https://github.com/fxsjy/jieba.

  8. 8.

    It can extract the sounds of the Chinese characters. https://github.com/mozillazg/python-pinyin.

  9. 9.

    It can convert the sounds into Chinese characters. https://github.com/letiantian/Pinyin2Hanzi.

  10. 10.

    The score is calculated based on the HMM principle. In general, the more commonly used words, the higher the score. https://github.com/letiantian/Pinyin2Hanzi.

  11. 11.

    https://github.com/baidubce/pie/tree/master.

  12. 12.

    A speech recognition kit. https://github.com/kaldi-asr/kaldi.

References

  1. Amodei, D., et al.: End to end speech recognition in English and Mandarin (2016)

    Google Scholar 

  2. Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)

    Google Scholar 

  3. Chang, T.H., Chen, H.C., Tseng, Y.H., Zheng, J.L.: Automatic detection and correction for Chinese misspelled words using phonological and orthographic similarities. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 97–101 (2013)

    Google Scholar 

  4. Chen, Y.Z., Wu, S.H., Yang, P.C., Ku, T., Chen, G.D.: Improve the detection of improperly used Chinese characters in students’ essays with error model. Int. J. Continuing Eng. Educ. Life Long Learn. 21(1), 103–116 (2011)

    Article  Google Scholar 

  5. Chen, Z., Lee, K.F.: A new statistical approach to Chinese pinyin input. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (2000)

    Google Scholar 

  6. Hsieh, Y.M., Bai, M.H., Huang, S.L., Chen, K.J.: Correcting chinese spelling errors with word lattice decoding. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP) 14(4), 18 (2015)

    Google Scholar 

  7. Liu, C.L., Lai, M.H., Tien, K.W., Chuang, Y.H., Wu, S.H., Lee, C.Y.: Visually and phonologically similar characters in incorrect Chinese words: analyses, identification, and applications. ACM Trans. Asian Lang. Inf. Process. (TALIP) 10(2), 10 (2011)

    Google Scholar 

  8. Liu, Y., Zan, H., Zhong, M., Ma, H.: Detecting simultaneously chinese grammar errors based on a BiLSTM-CRF model. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 188–193 (2018)

    Google Scholar 

  9. Povey, D., et al.: The Kaldi speech recognition toolkit. IEEE Signal Processing Society, Technical report (2011)

    Google Scholar 

  10. Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947 (2015)

  11. Wang, D., Fung, G.P.C., Debosschere, M., Dong, S., Zhu, J., Wong, K.F.: A new benchmark and evaluation schema for Chinese typo detection and correction. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  12. Wang, D., Song, Y., Li, J., Han, J., Zhang, H.: A hybrid approach to automatic corpus generation for Chinese spelling check. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2517–2527 (2018)

    Google Scholar 

  13. Wu, S.H., Liu, C.L., Lee, L.H.: Chinese spelling check evaluation at SIGHAN bake-off 2013. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 35–42 (2013)

    Google Scholar 

  14. Yang, S., Zhao, H., Wang, X., Lu, B.L.: Spell checking for Chinese. In: LREC, pp. 730–736 (2012)

    Google Scholar 

  15. Yongwei, Z., Qinan, H., Fang, L., Yueguo, G.: CMMC-BDRC solution to the NLP-TEA-2018 Chinese grammatical error diagnosis task. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 180–187 (2018)

    Google Scholar 

  16. Yu, J., Li, Z.: Chinese spelling error detection and correction based on language model, pronunciation, and shape. In: Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 220–223 (2014)

    Google Scholar 

  17. Yu, L.C., Lee, L.H., Tseng, Y.H., Chen, H.H.: Overview of SIGHAN 2014 bake-off for Chinese spelling check. In: Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 126–132 (2014)

    Google Scholar 

  18. Zheng, Y., Li, C., Sun, M.: CHIME: an efficient error-tolerant Chinese pinyin input method. In: Twenty-Second International Joint Conference on Artificial Intelligence (2011)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61672040), Beijing Urban Governance Research Center and the North China University of Technology Startup Fund. The corresponding author is Hao Wang.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hao Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Duan, J., Pan, L., Wang, H., Zhang, M., Wu, M. (2019). Automatically Build Corpora for Chinese Spelling Check Based on the Input Method. In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds) Natural Language Processing and Chinese Computing. NLPCC 2019. Lecture Notes in Computer Science(), vol 11838. Springer, Cham. https://doi.org/10.1007/978-3-030-32233-5_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-32233-5_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-32232-8

  • Online ISBN: 978-3-030-32233-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics