Abstract
We present a system to identify erroneous entries in a translation memory. It is a machine learning system that learns to classify entries according to either a strict or a permissive view on correctness. It is trained on features relating to segment length, translation quality checks, spelling and grammar errors, and additionally uses external data for detecting problems with fluency and lexical choice.
Similar content being viewed by others
Notes
For full details, see Barbu et al. (2016).
Personal correspondence with Barbu.
We quote the test names as they are used in the documentation of pofilter.
This is the only classifier for which the random seed can not make the runs fully reproducible. See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.
References
Barbu E (2015) Spotting false translation segments in translation memories. In: Proceedings of the Workshop Natural Language Processing for Translation Memories, Association for Computational Linguistics, Hissar, Bulgaria, pp 9–16, http://www.aclweb.org/anthology/W15-5202
Barbu E, Parra Escartín C, Bentivogli L, Negri M, Turchi M, Federico M, Mastrostefano L, Orăsan C (2016) 1st shared task on automatic translation memory cleaning preparation and lessons learned. In: 2nd Workshop on Natural Language Processing for Translation Memories (NLP4TM 2016), Portorož, Slovenia, LREC 2016, http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-NLP4TM_Proceedings.pdf
Dougherty G (2013) Pattern Recognition and Classification: An Introduction. Springer. doi:10.1007/978-1-4614-5323-9
Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Atlanta, Georgia, pp 644–648, http://www.aclweb.org/anthology/N13-1073
Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1):75–102, http://dl.acm.org/citation.cfm?id=972450.972455
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. Journal of Machine Learning Research 3:1157–1182, http://www.jmlr.org/papers/v3/guyon03a.html
Heafield K, Pouzyrevsky I, Clark JH, Koehn P (2013) Scalable modified Kneser-Ney language model estimation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp 690–696, http://kheafield.com/professional/edinburgh/estimate_paper.pdf
Koehn P (2005) Europarl: A parallel corpus for statistical machine translation. In: Conference Proceedings: the tenth Machine Translation Summit, AAMT, AAMT, Phuket, Thailand, pp 79–86, http://mt-archive.info/MTS-2005-Koehn.pdf
Lagoudaki E (2006) Translation memories survey 2006: Users’ perceptions around TM use. In: Proceedings of the ASLIB International Conference Translating & the Computer, vol 28
Miłkowski M (2010) Developing an open-source, rule-based proofreading tool. Software: Practice and Experience 40(7):543–566, doi:10.1002/spe.971
O’Brien S (2007) Eye-tracking and translation memory matches. Perspectives: Studies in translatology 14(3):185–205
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830
Specia L, Paetzold G, Scarton C (2015) Multi-level translation quality prediction with QuEst++. In: Proceedings of ACL-IJCNLP 2015 System Demonstrations, Beijing, China, pp 115–120, http://www.aclweb.org/anthology/P15-4020
Tiedemann J (2011) Bitext alignment. Synth Lect Hum Lang Technol. doi:10.2200/S00367ED1V01Y201106HLT014
Zariņa I, Ņikiforovs P, Skadiņš R (2015) Word alignment based parallel corpora evaluation and cleaning using machine learning techniques. In: El-Kahlout ID, Özkan M, Sánchez-Martínez F, Ramírez-Sánchez G, Hollowood F, Way A (eds) Proceedings of the 18th Annual Conference of the European Association for Machine Translation, Antalya, Turkey, pp 185–192, http://aclweb.org/anthology/W15-4924
Acknowledgements
This research was supported by the Academy of African Languages and Science Strategic Project of the University of South Africa. The author thanks the anonymous reviewers for valuable feedback.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wolff, F. Combining off-the-shelf components to clean a translation memory. Machine Translation 30, 167–181 (2016). https://doi.org/10.1007/s10590-016-9186-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-016-9186-7