Skip to main content
Log in

Combining off-the-shelf components to clean a translation memory

  • Published:
Machine Translation

Abstract

We present a system to identify erroneous entries in a translation memory. It is a machine learning system that learns to classify entries according to either a strict or a permissive view on correctness. It is trained on features relating to segment length, translation quality checks, spelling and grammar errors, and additionally uses external data for detecting problems with fluency and lexical choice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. For full details, see Barbu et al. (2016).

  2. Personal correspondence with Barbu.

  3. http://docs.translatehouse.org/projects/translate-toolkit/.

  4. https://github.com/hlt-mt/TMOP.

  5. http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands/pofilter_tests.html.

  6. We quote the test names as they are used in the documentation of pofilter.

  7. https://hunspell.github.io/.

  8. http://www.abisource.com/projects/enchant/.

  9. https://languagetool.org/.

  10. From https://www.languagetool.org/languages/.

  11. http://kheafield.com/code/kenlm/.

  12. https://github.com/clab/fast_align.

  13. This is the only classifier for which the random seed can not make the runs fully reproducible. See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

References

  • Barbu E (2015) Spotting false translation segments in translation memories. In: Proceedings of the Workshop Natural Language Processing for Translation Memories, Association for Computational Linguistics, Hissar, Bulgaria, pp 9–16, http://www.aclweb.org/anthology/W15-5202

  • Barbu E, Parra Escartín C, Bentivogli L, Negri M, Turchi M, Federico M, Mastrostefano L, Orăsan C (2016) 1st shared task on automatic translation memory cleaning preparation and lessons learned. In: 2nd Workshop on Natural Language Processing for Translation Memories (NLP4TM 2016), Portorož, Slovenia, LREC 2016, http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-NLP4TM_Proceedings.pdf

  • Dougherty G (2013) Pattern Recognition and Classification: An Introduction. Springer. doi:10.1007/978-1-4614-5323-9

    MathSciNet  MATH  Google Scholar 

  • Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Atlanta, Georgia, pp 644–648, http://www.aclweb.org/anthology/N13-1073

  • Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1):75–102, http://dl.acm.org/citation.cfm?id=972450.972455

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. Journal of Machine Learning Research 3:1157–1182, http://www.jmlr.org/papers/v3/guyon03a.html

  • Heafield K, Pouzyrevsky I, Clark JH, Koehn P (2013) Scalable modified Kneser-Ney language model estimation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp 690–696, http://kheafield.com/professional/edinburgh/estimate_paper.pdf

  • Koehn P (2005) Europarl: A parallel corpus for statistical machine translation. In: Conference Proceedings: the tenth Machine Translation Summit, AAMT, AAMT, Phuket, Thailand, pp 79–86, http://mt-archive.info/MTS-2005-Koehn.pdf

  • Lagoudaki E (2006) Translation memories survey 2006: Users’ perceptions around TM use. In: Proceedings of the ASLIB International Conference Translating & the Computer, vol 28

  • Miłkowski M (2010) Developing an open-source, rule-based proofreading tool. Software: Practice and Experience 40(7):543–566, doi:10.1002/spe.971

  • O’Brien S (2007) Eye-tracking and translation memory matches. Perspectives: Studies in translatology 14(3):185–205

    Google Scholar 

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Specia L, Paetzold G, Scarton C (2015) Multi-level translation quality prediction with QuEst++. In: Proceedings of ACL-IJCNLP 2015 System Demonstrations, Beijing, China, pp 115–120, http://www.aclweb.org/anthology/P15-4020

  • Tiedemann J (2011) Bitext alignment. Synth Lect Hum Lang Technol. doi:10.2200/S00367ED1V01Y201106HLT014

  • Zariņa I, Ņikiforovs P, Skadiņš R (2015) Word alignment based parallel corpora evaluation and cleaning using machine learning techniques. In: El-Kahlout ID, Özkan M, Sánchez-Martínez F, Ramírez-Sánchez G, Hollowood F, Way A (eds) Proceedings of the 18th Annual Conference of the European Association for Machine Translation, Antalya, Turkey, pp 185–192, http://aclweb.org/anthology/W15-4924

Download references

Acknowledgements

This research was supported by the Academy of African Languages and Science Strategic Project of the University of South Africa. The author thanks the anonymous reviewers for valuable feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Friedel Wolff.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wolff, F. Combining off-the-shelf components to clean a translation memory. Machine Translation 30, 167–181 (2016). https://doi.org/10.1007/s10590-016-9186-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-016-9186-7

Keywords

Navigation