Automatically Build Corpora for Chinese Spelling Check Based on the Input Method

Duan, Jianyong; Pan, Lijian; Wang, Hao; Zhang, Mei; Wu, Mingli

doi:10.1007/978-3-030-32233-5_37

Jianyong Duan^13,14,
Lijian Pan^13,14,
Hao Wang^13,14,
Mei Zhang¹³ &
…
Mingli Wu¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11838))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

2318 Accesses
4 Citations

Abstract

Chinese Spelling Check (CSC) is very important for Chinese language processing. To utilize supervised learning for CSC, one of the main challenges is that high-quality annotated corpora are not enough in building models. This paper proposes new approaches to automatically build the corpora of CSC based on the input method. We build two corpora: one is used to check the errors in the texts generated by the Pinyin input method, called p-corpus, and the other is used to check the errors in the texts generated by the voice input method, called v-corpus. The p-corpus is constructed using two methods, one is based on the conversion between Chinese characters and the sounds of the characters, and the other is based on Automatic Speech Recognition (ASR). The v-corpus is constructed based on ASR. We use the misspelled sentences in real language situation as the test set. Experimental results demonstrate that our corpora can get a better checking effect than the benchmark corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Pinyin is the annotation of Chinese pronunciation. https://en.wikipedia.org/wiki/Pinyin.
2.
Chinese tones range from 1 to 4.
3.
According to [6], sound edit distance 1 covers about 90% of spelling errors, and sound edit distance 2 accounts for almost all of the remaining spelling errors. Thus we consider two characters with sound edit distances 1 or 2 as similar characters. Such as “ ” (zhen4 “shock”) and “ ” (zheng4 “positive”), their sound edit distance is 1; hence, they are similar characters.
4.
According to statistics, there are 11 groups of fuzzy sounds in Chinese characters: z-zh, c-ch, s-sh, l-n, f-h, r-l, an-ang, en-eng, in-ing, ian-iang, uan-uang.
5.
https://catalog.ldc.upenn.edu/LDC2011T13, these articles reported have undergone a rigorous editing process and are considered to be all correct.
6.
http://www.openslr.org/resources/33/data_aishell, this speech library is transcoded by professional voice proofreaders and pass strict quality inspection. The correct rate of AlShell is above \(95\%\).
7.
The word segmentation tool used in this paper is jieba. https://github.com/fxsjy/jieba.
8.
It can extract the sounds of the Chinese characters. https://github.com/mozillazg/python-pinyin.
9.
It can convert the sounds into Chinese characters. https://github.com/letiantian/Pinyin2Hanzi.
10.
The score is calculated based on the HMM principle. In general, the more commonly used words, the higher the score. https://github.com/letiantian/Pinyin2Hanzi.
11.
https://github.com/baidubce/pie/tree/master.
12.
A speech recognition kit. https://github.com/kaldi-asr/kaldi.

References

Amodei, D., et al.: End to end speech recognition in English and Mandarin (2016)
Google Scholar
Bu, H., Du, J., Na, X., Wu, B., Zheng, H.: AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)
Google Scholar
Chang, T.H., Chen, H.C., Tseng, Y.H., Zheng, J.L.: Automatic detection and correction for Chinese misspelled words using phonological and orthographic similarities. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 97–101 (2013)
Google Scholar
Chen, Y.Z., Wu, S.H., Yang, P.C., Ku, T., Chen, G.D.: Improve the detection of improperly used Chinese characters in students’ essays with error model. Int. J. Continuing Eng. Educ. Life Long Learn. 21(1), 103–116 (2011)
Article Google Scholar
Chen, Z., Lee, K.F.: A new statistical approach to Chinese pinyin input. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (2000)
Google Scholar
Hsieh, Y.M., Bai, M.H., Huang, S.L., Chen, K.J.: Correcting chinese spelling errors with word lattice decoding. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP) 14(4), 18 (2015)
Google Scholar
Liu, C.L., Lai, M.H., Tien, K.W., Chuang, Y.H., Wu, S.H., Lee, C.Y.: Visually and phonologically similar characters in incorrect Chinese words: analyses, identification, and applications. ACM Trans. Asian Lang. Inf. Process. (TALIP) 10(2), 10 (2011)
Google Scholar
Liu, Y., Zan, H., Zhong, M., Ma, H.: Detecting simultaneously chinese grammar errors based on a BiLSTM-CRF model. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 188–193 (2018)
Google Scholar
Povey, D., et al.: The Kaldi speech recognition toolkit. IEEE Signal Processing Society, Technical report (2011)
Google Scholar
Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947 (2015)
Wang, D., Fung, G.P.C., Debosschere, M., Dong, S., Zhu, J., Wong, K.F.: A new benchmark and evaluation schema for Chinese typo detection and correction. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Google Scholar
Wang, D., Song, Y., Li, J., Han, J., Zhang, H.: A hybrid approach to automatic corpus generation for Chinese spelling check. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2517–2527 (2018)
Google Scholar
Wu, S.H., Liu, C.L., Lee, L.H.: Chinese spelling check evaluation at SIGHAN bake-off 2013. In: Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing, pp. 35–42 (2013)
Google Scholar
Yang, S., Zhao, H., Wang, X., Lu, B.L.: Spell checking for Chinese. In: LREC, pp. 730–736 (2012)
Google Scholar
Yongwei, Z., Qinan, H., Fang, L., Yueguo, G.: CMMC-BDRC solution to the NLP-TEA-2018 Chinese grammatical error diagnosis task. In: Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 180–187 (2018)
Google Scholar
Yu, J., Li, Z.: Chinese spelling error detection and correction based on language model, pronunciation, and shape. In: Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 220–223 (2014)
Google Scholar
Yu, L.C., Lee, L.H., Tseng, Y.H., Chen, H.H.: Overview of SIGHAN 2014 bake-off for Chinese spelling check. In: Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pp. 126–132 (2014)
Google Scholar
Zheng, Y., Li, C., Sun, M.: CHIME: an efficient error-tolerant Chinese pinyin input method. In: Twenty-Second International Joint Conference on Artificial Intelligence (2011)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61672040), Beijing Urban Governance Research Center and the North China University of Technology Startup Fund. The corresponding author is Hao Wang.

Author information

Authors and Affiliations

North China University of Technology, Beijing, China
Jianyong Duan, Lijian Pan, Hao Wang, Mei Zhang & Mingli Wu
CNONIX National Standard Application and Promotion Lab, Beijing, China
Jianyong Duan, Lijian Pan & Hao Wang

Authors

Jianyong Duan
View author publications
You can also search for this author in PubMed Google Scholar
Lijian Pan
View author publications
You can also search for this author in PubMed Google Scholar
Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Mingli Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hao Wang .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Jie Tang
National University of Singapore, Singapore, Singapore
Min-Yen Kan
Peking University, Beijing, China
Dongyan Zhao
Peking University, Beijing, China
Sujian Li
Zhengzhou University, Zhengzhou, China
Hongying Zan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Duan, J., Pan, L., Wang, H., Zhang, M., Wu, M. (2019). Automatically Build Corpora for Chinese Spelling Check Based on the Input Method. In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds) Natural Language Processing and Chinese Computing. NLPCC 2019. Lecture Notes in Computer Science(), vol 11838. Springer, Cham. https://doi.org/10.1007/978-3-030-32233-5_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-32233-5_37
Published: 30 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32232-8
Online ISBN: 978-3-030-32233-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)