Combining off-the-shelf components to clean a translation memory

Wolff, Friedel

doi:10.1007/s10590-016-9186-7

Combining off-the-shelf components to clean a translation memory

Published: 02 February 2017

Volume 30, pages 167–181, (2016)
Cite this article

Machine Translation

Friedel Wolff ORCID: orcid.org/0000-0002-6615-5780¹

280 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

We present a system to identify erroneous entries in a translation memory. It is a machine learning system that learns to classify entries according to either a strict or a permissive view on correctness. It is trained on features relating to segment length, translation quality checks, spelling and grammar errors, and additionally uses external data for detecting problems with fluency and lexical choice.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic translation memory cleaning

Article 10 February 2017

The first Automatic Translation Memory Cleaning Shared Task

Article 01 December 2016

Error Classification and Analysis for Machine Translation Quality Assessment

Notes

For full details, see Barbu et al. (2016).
Personal correspondence with Barbu.
http://docs.translatehouse.org/projects/translate-toolkit/.
https://github.com/hlt-mt/TMOP.
http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands/pofilter_tests.html.
We quote the test names as they are used in the documentation of pofilter.
https://hunspell.github.io/.
http://www.abisource.com/projects/enchant/.
https://languagetool.org/.
From https://www.languagetool.org/languages/.
http://kheafield.com/code/kenlm/.
https://github.com/clab/fast_align.
This is the only classifier for which the random seed can not make the runs fully reproducible. See http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

References

Barbu E (2015) Spotting false translation segments in translation memories. In: Proceedings of the Workshop Natural Language Processing for Translation Memories, Association for Computational Linguistics, Hissar, Bulgaria, pp 9–16, http://www.aclweb.org/anthology/W15-5202
Barbu E, Parra Escartín C, Bentivogli L, Negri M, Turchi M, Federico M, Mastrostefano L, Orăsan C (2016) 1^st shared task on automatic translation memory cleaning preparation and lessons learned. In: 2^nd Workshop on Natural Language Processing for Translation Memories (NLP4TM 2016), Portorož, Slovenia, LREC 2016, http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-NLP4TM_Proceedings.pdf
Dougherty G (2013) Pattern Recognition and Classification: An Introduction. Springer. doi:10.1007/978-1-4614-5323-9
MathSciNet MATH Google Scholar
Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Atlanta, Georgia, pp 644–648, http://www.aclweb.org/anthology/N13-1073
Gale WA, Church KW (1993) A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1):75–102, http://dl.acm.org/citation.cfm?id=972450.972455
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. Journal of Machine Learning Research 3:1157–1182, http://www.jmlr.org/papers/v3/guyon03a.html
Heafield K, Pouzyrevsky I, Clark JH, Koehn P (2013) Scalable modified Kneser-Ney language model estimation. In: Proceedings of the 51^st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp 690–696, http://kheafield.com/professional/edinburgh/estimate_paper.pdf
Koehn P (2005) Europarl: A parallel corpus for statistical machine translation. In: Conference Proceedings: the tenth Machine Translation Summit, AAMT, AAMT, Phuket, Thailand, pp 79–86, http://mt-archive.info/MTS-2005-Koehn.pdf
Lagoudaki E (2006) Translation memories survey 2006: Users’ perceptions around TM use. In: Proceedings of the ASLIB International Conference Translating & the Computer, vol 28
Miłkowski M (2010) Developing an open-source, rule-based proofreading tool. Software: Practice and Experience 40(7):543–566, doi:10.1002/spe.971
O’Brien S (2007) Eye-tracking and translation memory matches. Perspectives: Studies in translatology 14(3):185–205
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830
MathSciNet MATH Google Scholar
Specia L, Paetzold G, Scarton C (2015) Multi-level translation quality prediction with QuEst++. In: Proceedings of ACL-IJCNLP 2015 System Demonstrations, Beijing, China, pp 115–120, http://www.aclweb.org/anthology/P15-4020
Tiedemann J (2011) Bitext alignment. Synth Lect Hum Lang Technol. doi:10.2200/S00367ED1V01Y201106HLT014
Zariņa I, Ņikiforovs P, Skadiņš R (2015) Word alignment based parallel corpora evaluation and cleaning using machine learning techniques. In: El-Kahlout ID, Özkan M, Sánchez-Martínez F, Ramírez-Sánchez G, Hollowood F, Way A (eds) Proceedings of the 18th Annual Conference of the European Association for Machine Translation, Antalya, Turkey, pp 185–192, http://aclweb.org/anthology/W15-4924

Download references

Acknowledgements

This research was supported by the Academy of African Languages and Science Strategic Project of the University of South Africa. The author thanks the anonymous reviewers for valuable feedback.

Author information

Authors and Affiliations

AALS, College of Graduate Studies, University of South Africa, Pretoria, South Africa
Friedel Wolff

Authors

Friedel Wolff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Friedel Wolff.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wolff, F. Combining off-the-shelf components to clean a translation memory. Machine Translation 30, 167–181 (2016). https://doi.org/10.1007/s10590-016-9186-7

Download citation

Received: 05 June 2016
Accepted: 20 December 2016
Published: 02 February 2017
Issue Date: December 2016
DOI: https://doi.org/10.1007/s10590-016-9186-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining off-the-shelf components to clean a translation memory

Abstract

Access this article

Similar content being viewed by others

Automatic translation memory cleaning

The first Automatic Translation Memory Cleaning Shared Task

Error Classification and Analysis for Machine Translation Quality Assessment

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Combining off-the-shelf components to clean a translation memory

Abstract

Access this article

Similar content being viewed by others

Automatic translation memory cleaning

The first Automatic Translation Memory Cleaning Shared Task

Error Classification and Analysis for Machine Translation Quality Assessment

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation