Skip to main content
Log in

Human versus automatic quality evaluation of NMT and PBSMT

  • Published:
Machine Translation

Abstract

Neural machine translation (NMT) has recently gained substantial popularity not only in academia, but also in industry. For its acceptance in industry it is important to investigate how NMT performs in comparison to the phrase-based statistical MT (PBSMT) model, that until recently was the dominant MT paradigm. In the present work, we compare the quality of the PBSMT and NMT solutions of KantanMT—a commercial platform for custom MT—that are tailored to accommodate large-scale translation production, where there is a limited amount of time to train an end-to-end system (NMT or PBSMT). In order to satisfy the time requirements of our production line, we restrict the NMT training time to 4 days; to train a PBSMT system typically requires no longer than one day with the current training pipeline of KantanMT. To train both NMT and PBSMT engines for each language pair, we strictly use the same parallel corpora and the same pre- and post-processing steps (when applicable). Our results show that, even with time-restricted training of 4 days, NMT quality substantially surpasses that of PBSMT. Furthermore, we challenge the reliability of automatic quality evaluation metrics based on n-gram comparison (in particular F-measure, BLEU and TER) for NMT quality evaluation. We support our hypothesis with both analytical and empirical evidence. We investigate how suitable these metrics are when comparing the two different paradigms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. https://translate.google.com/.

  2. https://www.bing.com/translator/.

  3. https://systransoft.com.

  4. https://kantanmt.com/.

  5. F-measure, BLEU and TER are algorithms for quality evaluation of MT systems, typically used to estimate fluency, adequacy and extent of translation errors (cf. Way 2018b for more details).

  6. We consider tokenization and cleaning as general preprocessing steps; word segmentation (e.g. Byte Pair Encoding (BPE): Sennrich et al. 2016) is an NMT-specific pre-processing step.

  7. http://workshop2015.iwslt.org/.

  8. http://www.casmacat.eu/corpus/news-commentary.html.

  9. BLEU scores are presented in the range of 0–100.

  10. In http://kv-emptypages.blogspot.ie/2016/09/the-google-neural-machine-translation.html the author argues against the generalizability of the results and the appropriateness of the evaluations performed.

  11. http://tramooc.eu.

  12. Few translations will attain a score of 1 unless they are identical to a reference translation. Even translations by professional translators will not necessarily obtain a BLEU score of 1.

  13. Character-based F-measure was also shown to correlate well with human judgment (Popović 2015).

  14. See e.g. https://deeplearning4j.org/word2vec.html.

  15. https://github.com/moses-smt/mosesdecoder/scripts/generic/multi-bleu.perl.

  16. An MT engine refers to the package of models (translation, language and recasing models for PBSMT, and an encoder–decoder model for NMT) as well as to the required rules and dictionaries for pre- and post-processing.

  17. KantanMT provides both cloud-based and on-premise solutions.

  18. Typically additional monolingual data is employed to train the language model of a PBSMT engine. While we acknowledge that monolingual data, among other optimisations, may improve a PBSMT engine, within the scope of this work we keep the same-data assumption. That is, we do not employ any other data except for the parallel corpus provided. This requirement is vital when trying to answer the question: “When a user has at their disposal a set of parallel data, which of the two paradigms is preferable to train: PBSMT or NMT?”.

  19. Expressed in terms of increases in BLEU and F-measure and decreases in TER, as well as according to internal human evaluation.

  20. By Chinese, we mean Simplified Mandarin Chinese.

  21. https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory.

  22. http://opus.lingfil.uu.se/.

  23. Training, testing and tuning data was normalised prior to building the MT engines.

References

  • Agarwal A, Lavie A (2008) METEOR, M-BLEU and M-TER: evaluation metrics for high-correlation with human rankings of machine translation output. In: Proceedings of the third workshop on statistical machine translation, Columbus, Ohio, pp 115–118

  • Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Proceedings of the 6th international conference on learning representations (ICLR 2015), San Diego, CA, USA

  • Bentivogli L, Bisazza A, Cettolo M, Federico M (2016) Neural versus phrase-based machine translation quality: a case study. In: Proceedings of the 2016 conference on empirical methods in natural language processing, Austin, Texas, pp 257–267

  • Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: Proceedings of the eleventh conference of the European chapter of the association for computational linguistics, Trento, Italy, pp 249–256

  • Castilho S, Moorkens J, Gaspari F, Calixto I, Tinsley J, Way A (2017) Is neural machine translation the new state of the art? Prague Bull Math Linguist 108(1):109–120

    Article  Google Scholar 

  • Cer D, Manning CD, Jurafsky D (2010) The best lexical metric for phrase-based statistical MT system optimization. In: Human language technologies: the 2010 annual conference of the North American chapter of the association for computational linguistics, Los Angeles, California, pp 555–563

  • Cettolo M, Niehues J, Stüker S, Bentivogli L, Cattoni R, Federico M (2015) The IWSLT 2015 evaluation campaign. In: Proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 2–14

  • Chen B, Cherry C (2014) A systematic comparison of smoothing techniques for sentence-level BLEU. In: Proceedings of the ninth workshop on statistical machine translation (WMT@ACL 2014), Baltimore, Maryland, USA, pp 362–367

  • Chiang D (2005) A hierarchical phrase-based model for statistical machine translation. In: Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL’05), Ann Arbor, Michigan, pp 263–270

  • Chiang D, DeNeefe S, Chan YS, Ng HT (2008) Decomposability of translation metrics for improved evaluation and efficient algorithms. In: Proceedings of the conference on empirical methods in natural language processing, Honolulu, Hawaii, USA, pp 610–619

  • Cho K, van Merriënboer B, Gülçehre Ç, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing, Doha, Qatar, pp 1724–1734

  • Chung J, Cho K, Bengio Y (2016) A character-level decoder without explicit segmentation for neural machine translation. In: Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, vol 1, long papers, Berlin, Germany, pp 1693–1703

  • Costa-Jussà MR, Farrús M, Mariño JB, Fonollosa JAR (2012) Study and comparison of rule-based and statistical Catalan-Spanish machine translation systems. Comput Inform 31(2):245–270

    MATH  Google Scholar 

  • Crego JM, Kim J, Klein G, Rebollo A, Yang K, Senellart J, Akhanov E, Brunelle P, Coquard A, Deng Y, Enoue S, Geiss C, Johanson J, Khalsa A, Khiari R, Ko B, Kobus C, Lorieux J, Martins L, Nguyen D, Priori A, Riccardi T, Segal N, Servan C, Tiquet C, Wang B, Yang J, Zhang D, Zhou J, Zoldan P (2016) Systran’s pure neural machine translation systems. CoRR arXiv:1610.05540

  • Daems J, Vandepitte S, Hartsuiker RJ, Macken L (2017) Identifying the machine translation error types with the greatest impact on post-editing effort. Front Psychol 8:1282

    Article  Google Scholar 

  • Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL), Atlanta, USA, pp 644–649

  • Farrús M, Costa-jussà MR, Popović M (2012) Study and correlation analysis of linguistic, perceptual, and automatic machine translation evaluations. J Assoc Inf Sci Technol 63(1):174–184

    Article  Google Scholar 

  • Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382

    Article  Google Scholar 

  • Ha TL, Niehues J, Eunah C, Mediani M, Waibel A (2015) The KIT translation systems for IWSLT 2015. In: Proceedings of the 12th international workshop on spoken language translation, Da Nang, Vietnam, pp 62–69

  • Junczys-Dowmunt M, Dwojak T, Hoang H (2016) Is neural machine translation ready for deployment? A case study on 30 translation directions. In: Proceedings of the 9th international workshop on spoken language translation, Seattle, WA

  • Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. CoRR arXiv:1412.6980

  • Klein G, Kim Y, Deng Y, Senellart J, Rush AM (2017) Opennmt: open-source toolkit for neural machine translation. In: Proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, System Demonstrations, Vancouver, Canada, pp 67–72

  • Klubička F, Toral A, Sánchez-Cartagena VM (2017) Fine-grained human evaluation of neural versus phrase-based machine translation. The Prague Bulletin of Mathematical Linguistics, pp 121–132

  • Koehn P (2010) Statistical machine translation, 1st edn. Cambridge University Press, New York, NY

    MATH  Google Scholar 

  • Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, Prague, Czech Republic, pp 177–180

  • Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33(1):159–174

    Article  Google Scholar 

  • Luong MT, Manning CD (2015) Stanford neural machine translation systems for spoken language domains. In: Proceedings of the 12th international workshop on spoken language translation (IWSLT), Da Nang, Vietnam, pp 76–79

  • Melamed ID, Green R, Turian JP (2003) Precision and recall of machine translation. In: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT-NAACL 2003, Edmonton, Canada, pp 61–63

  • Och F, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51

    Article  Google Scholar 

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Philadelphia, Pennsylvania, USA, pp 311–318

  • Popović M (2015) chrF: character n-gram f-score for automatic MT evaluation. In: Proceedings of the tenth workshop on statistical machine translation (WMT@EMNLP 2015), Lisbon, Portugal, pp 392–395

  • Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units. In: Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, vol 1, Long Papers, Berlin, Germany, pp 1715–1725

  • Shterionov D, Du J, Palminteri MA, Casanellas L, O’Dowd T, Way A (2016) Improving KantanMT training efficiency with FastAlign. In: Proceedings of AMTA 2016, the twelfth conference of the Association for Machine Translation in the Americas, vol 2, MT Users’ Track, Austin, TX, USA, pp 222–231

  • Shterionov D, Nagle P, Casanellas L, Superbo R, ODowd T (2017) Empirical evaluation of NMT and PBSMT quality for large-scale translation production. In: Proceedings of the user track of the 20th annual conference of the European Association for Machine Translation (EAMT), Prague, Czech Republic, pp 74–79

  • Smith A, Hardmeier C, Tiedemann J (2016) Climbing Mont BLEU: the strange world of reachable high-BLEU translations. In: Proceedings of the 19th annual conference of the European Association for Machine Translation, EAMT 2017, Riga, Latvia, pp 269–281

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006. Proceedings of the 7th conference of the association for machine translation of the Americas. Visions for the future of machine translation, Cambridge, Massachusetts, USA, pp 223–231

  • Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Proceedings of advances in neural information processing systems 27: annual conference on neural information processing systems, Montreal, Quebec, Canada, pp 3104–3112

  • Vanmassenhove E, Du J, Way A (2016) Improving subject-verb agreement in SMT. In: Proceedings of the fifth workshop on hybrid approaches to translation, Riga, Latvia

  • Way A (2018a) Machine translation: where are we at today? In: Angelone E, Massey G, Ehrensberger-Dow M (eds) The Bloomsbury Companion to language industry studies. Bloomsbury, London

    Google Scholar 

  • Way A (2018b) Quality expectations of machine translation. In: Moorkens J, Castilho S, Gaspari F, Doherty S (eds) Translation quality assessment: from principles to practice. Springer, Berlin

    Google Scholar 

  • Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR arXiv:1609.08144

  • Ziemski M, Junczys-Dowmunt M, Pouliquen B (2016) The United Nations Parallel Corpus v1.0. In: Proceedings of the tenth international conference on language resources and evaluation, Portorož, Slovenia, pp 3530–3534

Download references

Acknowledgements

We would like to thank our external evaluators: Xiyi Fan, Ruopu Wang, Wan Nie, Ayumi Tanaka, Maki Iwamoto, Risako Hayakawa, Silvia Doehner, Daniela Naumann, Moritz Philipp, Annabella Ferola, Anna Ricciardelli, Paola Gentile, Celia Ruiz Arca, Clara Beltrá, Giulia Mattoni, as well as University College London, Dublin City University, KU Leuven, University of Strasbourg, and University of Stuttgart.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitar Shterionov.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shterionov, D., Superbo, R., Nagle, P. et al. Human versus automatic quality evaluation of NMT and PBSMT. Machine Translation 32, 217–235 (2018). https://doi.org/10.1007/s10590-018-9220-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-018-9220-z

Keywords

Navigation