The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and Its Application to Grammatical Error Correction

Grundkiewicz, Roman; Junczys-Dowmunt, Marcin

doi:10.1007/978-3-319-10888-9_47

Roman Grundkiewicz²⁰ &
Marcin Junczys-Dowmunt²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8686))

Included in the following conference series:

International Conference on Natural Language Processing

2060 Accesses
8 Citations

Abstract

This paper introduces the freely available WikEd Error Corpus. We describe the data mining process from Wikipedia revision histories, corpus content and format. The corpus consists of more than 12 million sentences with a total of 14 million edits of various types. As one possible application, we show that WikEd can be successfully adapted to improve a strong baseline in a task of grammatical error correction for English-as-a-Second-Language (ESL) learners’ writings by 2.63%. Used together with an ESL error corpus, a composed system gains 1.64% when compared to the ESL-trained system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Brockett, C., Dolan, W.B., Gamon, M.: Correcting ESL Errors Using Phrasal SMT Techniques. In: Proceedings of ACL, pp. 249–256. ACL (2006)
Google Scholar
Buck, C., Heafield, K., van Ooyen, B.: N-gram Counts and Language Models from the Common Crawl. In: Proceedings of LREC (2014)
Google Scholar
Cahill, A., Madnani, N., Tetreault, J.R., Napolitano, D.: Robust Systems for Preposition Error Correction Using Wikipedia Revisions. In: Proceedings of NAACL: HLT, pp. 507–517. ACL (2013)
Google Scholar
Dahlmeier, D., Ng, H.T.: Better Evaluation for Grammatical Error Correction. In: Proceedings of NAACL: HLT, pp. 568–572. ACL (2012)
Google Scholar
Dahlmeier, D., Ng, H.T., Wu, S.M.: Building a Large Annotated Corpus of Learner English: The NUS Corpus of Learner English. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 22–31. ACL (2013)
Google Scholar
Foster, J., Andersen, O.E.: GenERRate: Generating Errors for Use in Grammatical Error Detection. In: Proceedings of the Fourth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 82–90. ACL (2009)
Google Scholar
Grundkiewicz, R.: Automatic Extraction of Polish Language Errors from Text Edition History. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 129–136. Springer, Heidelberg (2013)
Google Scholar
Junczys-Dowmunt, M., Grundkiewicz, R.: The AMU System in the CoNLL-2014 Shared Task: Grammatical Error Correction by Data-Intensive and Feature-Rich Statistical Machine Translation. In: Proceedings of CoNLL: Shared Task. ACL (2014)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A., Herbst, E.: Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL 2007, pp. 177–180. ACL (2007)
Google Scholar
Leacock, C., Chodorow, M., Gamon, M., Tetreault, J.: Automated Grammatical Error Detection for Language Learners. Morgan and Claypool Publishers (2010)
Google Scholar
Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10 (1966)
Google Scholar
Maier, D.: The Complexity of Some Problems on Subsequences and Supersequences. J. ACM 25(2), 322–336 (1978)
Article MATH MathSciNet Google Scholar
Max, A., Wisniewski, G.: Mining Naturally-occurring Corrections and Paraphrases from Wikipedia’s Revision History. In: Proceedings of LREC (2010)
Google Scholar
Miłkowski, M.: Automated Building of Error Corpora of Polish. In: Corpus Linguistics, Computer Tools, and Applications — State of the Art, pp. 631–639. Peter Lang (2008)
Google Scholar
Mizumoto, T., Hayashibe, Y., Komachi, M., Nagata, M., Matsumoto, Y.: The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings. In: Proceedings of COLING 2012: Posters, pp. 863–872 (2012)
Google Scholar
Mizumoto, T., Komachi, M., Nagata, M., Matsumoto, Y.: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners. In: IJCNLP, pp. 147–155 (2011)
Google Scholar
Ng, H.T., Wu, S.M., Briscoe, T., Hadiwinoto, C., Susanto, R.H., Bryant, C.: The CoNLL-2014 Shared Task on Grammatical Error Correction. In: Proceedings of CoNLL: Shared Task. ACL (2014)
Google Scholar
Ng, H.T., Wu, S.M., Wu, Y., Hadiwinoto, C., Tetreault, J.: The CoNLL-2013 Shared Task on Grammatical Error Correction. In: Proceedings of CoNLL: Shared Task, pp. 1–12. ACL (2013)
Google Scholar
Wagner, J., Foster, J., van Genabith, J.: A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors. In: Proceedings of EMNLP-CoNLL, pp. 112–121. ACL (2007)
Google Scholar
Yuan, Z., Felice, M.: Constrained Grammatical Error Correction using Statistical Machine Translation. In: Proceedings of CoNLL: Shared Task, pp. 52–61. ACL (2013)
Google Scholar
Zesch, T.: Measuring Contextual Fitness Using Error Contexts Extracted from the Wikipedia Revision History. In: Proceedings of EACL, pp. 529–538 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Mathematics and Computer Science, Adam Mickiewicz University in Poznań, ul. Umultowska 87, 61-614, Poznań, Poland
Roman Grundkiewicz & Marcin Junczys-Dowmunt

Authors

Roman Grundkiewicz
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Junczys-Dowmunt
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Science, Polish Academy of Sciences, ul. Jana Kazimierza 5, 01-248, Warsaw, Poland
Adam Przepiórkowski & Maciej Ogrodniczuk &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Grundkiewicz, R., Junczys-Dowmunt, M. (2014). The WikEd Error Corpus: A Corpus of Corrective Wikipedia Edits and Its Application to Grammatical Error Correction. In: Przepiórkowski, A., Ogrodniczuk, M. (eds) Advances in Natural Language Processing. NLP 2014. Lecture Notes in Computer Science(), vol 8686. Springer, Cham. https://doi.org/10.1007/978-3-319-10888-9_47

Download citation

DOI: https://doi.org/10.1007/978-3-319-10888-9_47
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10887-2
Online ISBN: 978-3-319-10888-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics