Human versus automatic quality evaluation of NMT and PBSMT

Shterionov, Dimitar; Superbo, Riccardo; Nagle, Pat; Casanellas, Laura; O’Dowd, Tony; Way, Andy

doi:10.1007/s10590-018-9220-z

Human versus automatic quality evaluation of NMT and PBSMT

Published: 08 May 2018

Volume 32, pages 217–235, (2018)
Cite this article

Machine Translation

Dimitar Shterionov ORCID: orcid.org/0000-0001-6300-797X¹,
Riccardo Superbo²,
Pat Nagle²,
Laura Casanellas²,
Tony O’Dowd² &
…
Andy Way¹

1617 Accesses
23 Citations
10 Altmetric
Explore all metrics

Abstract

Neural machine translation (NMT) has recently gained substantial popularity not only in academia, but also in industry. For its acceptance in industry it is important to investigate how NMT performs in comparison to the phrase-based statistical MT (PBSMT) model, that until recently was the dominant MT paradigm. In the present work, we compare the quality of the PBSMT and NMT solutions of KantanMT—a commercial platform for custom MT—that are tailored to accommodate large-scale translation production, where there is a limited amount of time to train an end-to-end system (NMT or PBSMT). In order to satisfy the time requirements of our production line, we restrict the NMT training time to 4 days; to train a PBSMT system typically requires no longer than one day with the current training pipeline of KantanMT. To train both NMT and PBSMT engines for each language pair, we strictly use the same parallel corpora and the same pre- and post-processing steps (when applicable). Our results show that, even with time-restricted training of 4 days, NMT quality substantially surpasses that of PBSMT. Furthermore, we challenge the reliability of automatic quality evaluation metrics based on n-gram comparison (in particular F-measure, BLEU and TER) for NMT quality evaluation. We support our hypothesis with both analytical and empirical evidence. We investigate how suitable these metrics are when comparing the two different paradigms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Experimenting with Different Machine Translation Models in Medium-Resource Settings

Machine translation and its evaluation: a study

Article 19 February 2023

Quality Expectations of Machine Translation

Notes

https://translate.google.com/.
https://www.bing.com/translator/.
https://systransoft.com.
https://kantanmt.com/.
F-measure, BLEU and TER are algorithms for quality evaluation of MT systems, typically used to estimate fluency, adequacy and extent of translation errors (cf. Way 2018b for more details).
We consider tokenization and cleaning as general preprocessing steps; word segmentation (e.g. Byte Pair Encoding (BPE): Sennrich et al. 2016) is an NMT-specific pre-processing step.
http://workshop2015.iwslt.org/.
http://www.casmacat.eu/corpus/news-commentary.html.
BLEU scores are presented in the range of 0–100.
In http://kv-emptypages.blogspot.ie/2016/09/the-google-neural-machine-translation.html the author argues against the generalizability of the results and the appropriateness of the evaluations performed.
http://tramooc.eu.
Few translations will attain a score of 1 unless they are identical to a reference translation. Even translations by professional translators will not necessarily obtain a BLEU score of 1.
Character-based F-measure was also shown to correlate well with human judgment (Popović 2015).
See e.g. https://deeplearning4j.org/word2vec.html.
https://github.com/moses-smt/mosesdecoder/scripts/generic/multi-bleu.perl.
An MT engine refers to the package of models (translation, language and recasing models for PBSMT, and an encoder–decoder model for NMT) as well as to the required rules and dictionaries for pre- and post-processing.
KantanMT provides both cloud-based and on-premise solutions.
Typically additional monolingual data is employed to train the language model of a PBSMT engine. While we acknowledge that monolingual data, among other optimisations, may improve a PBSMT engine, within the scope of this work we keep the same-data assumption. That is, we do not employ any other data except for the parallel corpus provided. This requirement is vital when trying to answer the question: “When a user has at their disposal a set of parallel data, which of the two paradigms is preferable to train: PBSMT or NMT?”.
Expressed in terms of increases in BLEU and F-measure and decreases in TER, as well as according to internal human evaluation.
By Chinese, we mean Simplified Mandarin Chinese.
https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory.
http://opus.lingfil.uu.se/.
Training, testing and tuning data was normalised prior to building the MT engines.

References

Agarwal A, Lavie A (2008) METEOR, M-BLEU and M-TER: evaluation metrics for high-correlation with human rankings of machine translation output. In: Proceedings of the third workshop on statistical machine translation, Columbus, Ohio, pp 115–118
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of the 6th international conference on learning representations (ICLR 2015), San Diego, CA, USA
Bentivogli L, Bisazza A, Cettolo M, Federico M (2016) Neural versus phrase-based machine translation quality: a case study. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, Texas, pp 257–267
Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: Proceedings of the eleventh conference of the European chapter of the association for computational linguistics, Trento, Italy, pp 249–256
Castilho S, Moorkens J, Gaspari F, Calixto I, Tinsley J, Way A (2017) Is neural machine translation the new state of the art? Prague Bull Math Linguist 108(1):109–120
Article Google Scholar
Cer D, Manning CD, Jurafsky D (2010) The best lexical metric for phrase-based statistical MT system optimization. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics, Los Angeles, California, pp 555–563
Cettolo M, Niehues J, Stüker S, Bentivogli L, Cattoni R, Federico M (2015) The IWSLT 2015 evaluation campaign. In: Proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 2–14
Chen B, Cherry C (2014) A systematic comparison of smoothing techniques for sentence-level BLEU. In: Proceedings of the ninth workshop on statistical machine translation (WMT@ACL 2014), Baltimore, Maryland, USA, pp 362–367
Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05), Ann Arbor, Michigan, pp 263–270
Chiang D, DeNeefe S, Chan YS, Ng HT (2008) Decomposability of translation metrics for improved evaluation and efficient algorithms. In: Proceedings of the conference on empirical methods in natural language processing, Honolulu, Hawaii, USA, pp 610–619
Cho K, van Merriënboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, Doha, Qatar, pp 1724–1734
Chung J, Cho K, Bengio Y (2016) A character-level decoder without explicit segmentation for neural machine translation. In: Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, vol 1, long papers, Berlin, Germany, pp 1693–1703
Costa-Jussà MR, Farrús M, Mariño JB, Fonollosa JAR (2012) Study and comparison of rule-based and statistical Catalan-Spanish machine translation systems. Comput Inform 31(2):245–270
MATH Google Scholar
Crego JM, Kim J, Klein G, Rebollo A, Yang K, Senellart J, Akhanov E, Brunelle P, Coquard A, Deng Y, Enoue S, Geiss C, Johanson J, Khalsa A, Khiari R, Ko B, Kobus C, Lorieux J, Martins L, Nguyen D, Priori A, Riccardi T, Segal N, Servan C, Tiquet C, Wang B, Yang J, Zhang D, Zhou J, Zoldan P (2016) Systran’s pure neural machine translation systems. CoRR arXiv:1610.05540
Daems J, Vandepitte S, Hartsuiker RJ, Macken L (2017) Identifying the machine translation error types with the greatest impact on post-editing effort. Front Psychol 8:1282
Article Google Scholar
Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL), Atlanta, USA, pp 644–649
Farrús M, Costa-jussà MR, Popović M (2012) Study and correlation analysis of linguistic, perceptual, and automatic machine translation evaluations. J Assoc Inf Sci Technol 63(1):174–184
Article Google Scholar
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382
Article Google Scholar
Ha TL, Niehues J, Eunah C, Mediani M, Waibel A (2015) The KIT translation systems for IWSLT 2015. In: Proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 62–69
Junczys-Dowmunt M, Dwojak T, Hoang H (2016) Is neural machine translation ready for deployment? A case study on 30 translation directions. In: Proceedings of the 9th international workshop on spoken language translation, Seattle, WA
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. CoRR arXiv:1412.6980
Klein G, Kim Y, Deng Y, Senellart J, Rush AM (2017) Opennmt: open-source toolkit for neural machine translation. In: Proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, System Demonstrations, Vancouver, Canada, pp 67–72
Klubička F, Toral A, Sánchez-Cartagena VM (2017) Fine-grained human evaluation of neural versus phrase-based machine translation. The Prague Bulletin of Mathematical Linguistics, pp 121–132
Koehn P (2010) Statistical machine translation, 1st edn. Cambridge University Press, New York, NY
MATH Google Scholar
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, Prague, Czech Republic, pp 177–180
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1):159–174
Article Google Scholar
Luong MT, Manning CD (2015) Stanford neural machine translation systems for spoken language domains. In: Proceedings of the 12th international workshop on spoken language translation (IWSLT), Da Nang, Vietnam, pp 76–79
Melamed ID, Green R, Turian JP (2003) Precision and recall of machine translation. In: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003, Edmonton, Canada, pp 61–63
Och F, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Article Google Scholar
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Philadelphia, Pennsylvania, USA, pp 311–318
Popović M (2015) chrF: character n-gram f-score for automatic MT evaluation. In: Proceedings of the tenth workshop on statistical machine translation (WMT@EMNLP 2015), Lisbon, Portugal, pp 392–395
Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, vol 1, Long Papers, Berlin, Germany, pp 1715–1725
Shterionov D, Du J, Palminteri MA, Casanellas L, O’Dowd T, Way A (2016) Improving KantanMT training efficiency with FastAlign. In: Proceedings of AMTA 2016, the twelfth conference of the Association for Machine Translation in the Americas, vol 2, MT Users’ Track, Austin, TX, USA, pp 222–231
Shterionov D, Nagle P, Casanellas L, Superbo R, ODowd T (2017) Empirical evaluation of NMT and PBSMT quality for large-scale translation production. In: Proceedings of the user track of the 20th annual conference of the European Association for Machine Translation (EAMT), Prague, Czech Republic, pp 74–79
Smith A, Hardmeier C, Tiedemann J (2016) Climbing Mont BLEU: the strange world of reachable high-BLEU translations. In: Proceedings of the 19th annual conference of the European Association for Machine Translation, EAMT 2017, Riga, Latvia, pp 269–281
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006. Proceedings of the 7th conference of the association for machine translation of the Americas. Visions for the future of machine translation, Cambridge, Massachusetts, USA, pp 223–231
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of advances in neural information processing systems 27: annual conference on neural information processing systems, Montreal, Quebec, Canada, pp 3104–3112
Vanmassenhove E, Du J, Way A (2016) Improving subject-verb agreement in SMT. In: Proceedings of the fifth workshop on hybrid approaches to translation, Riga, Latvia
Way A (2018a) Machine translation: where are we at today? In: Angelone E, Massey G, Ehrensberger-Dow M (eds) The Bloomsbury Companion to language industry studies. Bloomsbury, London
Google Scholar
Way A (2018b) Quality expectations of machine translation. In: Moorkens J, Castilho S, Gaspari F, Doherty S (eds) Translation quality assessment: from principles to practice. Springer, Berlin
Google Scholar
Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR arXiv:1609.08144
Ziemski M, Junczys-Dowmunt M, Pouliquen B (2016) The United Nations Parallel Corpus v1.0. In: Proceedings of the tenth international conference on language resources and evaluation, Portorož, Slovenia, pp 3530–3534

Download references

Acknowledgements

We would like to thank our external evaluators: Xiyi Fan, Ruopu Wang, Wan Nie, Ayumi Tanaka, Maki Iwamoto, Risako Hayakawa, Silvia Doehner, Daniela Naumann, Moritz Philipp, Annabella Ferola, Anna Ricciardelli, Paola Gentile, Celia Ruiz Arca, Clara Beltrá, Giulia Mattoni, as well as University College London, Dublin City University, KU Leuven, University of Strasbourg, and University of Stuttgart.

Author information

Authors and Affiliations

ADAPT Centre, School of Computing, Dublin City University, Dublin, Ireland
Dimitar Shterionov & Andy Way
KantanMT, Dublin City University, INVENT Building, Dublin, Ireland
Riccardo Superbo, Pat Nagle, Laura Casanellas & Tony O’Dowd

Authors

Dimitar Shterionov
View author publications
You can also search for this author in PubMed Google Scholar
Riccardo Superbo
View author publications
You can also search for this author in PubMed Google Scholar
Pat Nagle
View author publications
You can also search for this author in PubMed Google Scholar
Laura Casanellas
View author publications
You can also search for this author in PubMed Google Scholar
Tony O’Dowd
View author publications
You can also search for this author in PubMed Google Scholar
Andy Way
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitar Shterionov.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shterionov, D., Superbo, R., Nagle, P. et al. Human versus automatic quality evaluation of NMT and PBSMT. Machine Translation 32, 217–235 (2018). https://doi.org/10.1007/s10590-018-9220-z

Download citation

Received: 04 September 2017
Accepted: 09 April 2018
Published: 08 May 2018
Issue Date: September 2018
DOI: https://doi.org/10.1007/s10590-018-9220-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human versus automatic quality evaluation of NMT and PBSMT

Abstract

Access this article

Similar content being viewed by others

Experimenting with Different Machine Translation Models in Medium-Resource Settings

Machine translation and its evaluation: a study

Quality Expectations of Machine Translation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Human versus automatic quality evaluation of NMT and PBSMT

Abstract

Access this article

Similar content being viewed by others

Experimenting with Different Machine Translation Models in Medium-Resource Settings

Machine translation and its evaluation: a study

Quality Expectations of Machine Translation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation