The Significance of Recall in Automatic Metrics for MT Evaluation

Lavie, Alon; Sagae, Kenji; Jayaraman, Shyamsundar

doi:10.1007/978-3-540-30194-3_16

Alon Lavie²⁰,
Kenji Sagae²⁰ &
Shyamsundar Jayaraman²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3265))

Included in the following conference series:

Conference of the Association for Machine Translation in the Americas

1202 Accesses
32 Citations

Abstract

Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that significantly better correlations can be achieved by placing more weight on recall than on precision. While this may seem unexpected, since BLEU and NIST focus on n-gram precision and disregard recall, our experiments show that correlation with human judgments is highest when almost all of the weight is assigned to recall. We also show that stemming is significantly beneficial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kishore, P., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, pp. 311–318 (July 2002)
Google Scholar
Doddington, G.: Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics. In: Proceedings of the Second Conference on Human Language Technology (HLT 2002), San Diego, CA, pp. 128–132 (2002)
Google Scholar
Su, K.-Y., Wu, M.-W., Chang, J.-S.: A New Quantitative Quality Measure for Machine Translation Systems. In: Proceedings of the fifteenth International Conference on Computational Linguistics (COLING 1992), Nantes, France, pp. 433–439 (1992)
Google Scholar
Akiba, Y., Imamura, K., Sumita, E.: Using Multiple Edit Distances to Automatically Rank Machine Translation Output. In: Proceedings of MT Summit VIII, Santiago de Compostela, Spain, pp. 15–20 (2001)
Google Scholar
Niessen, S., Och, F.J., Leusch, G., Ney, H.: An Evaluation Tool for Machine Translation: Fast Evaluation for Machine Translation Research. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000) Athens, Greece, pp. 39–45 (2000)
Google Scholar
Leusch, G., Ueffing, N., Ney, H.: String-to-String Distance Measure with Applications to Machine Translation Evaluation. In: Proceedings of MT Summit IX, New Orleans, LA, September 2003, pp. 240–247 (2003)
Google Scholar
Melamed, D., Green, R., Turian, J.: Precision and Recall of Machine Translation. In: Proceedings of HLT-NAACL 2003.Short Papers, Edmonton, Canada, May 2003, pp. 61–63 (2003)
Google Scholar
Turian, J.P., Shen, L., Dan Melamed, I.: Evaluation of Machine Translation and its Evaluation. In: Proceedings of MT Summit IX, New Orleans, LA, September 2003, pp. 386–393 (2003)
Google Scholar
Lin, C.-Y., Hovy, E.: Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In: Proceedings of HLT-NAACL 2003, Edmonton, Canada, May 2003, pp. 71–78 (2003)
Google Scholar
van Rijsbergen, C.: Information Retrieval. Butterworths, 2nd edn., London, England (1979)
Google Scholar
Coughlin, D.: Correlating Automated and Human Assessments of Machine Translation Quality. In: Proceedings of MT Summit IX, New Orleans, LA, September 2003, pp. 63–70 (2003)
Google Scholar
Efron, B., Tibshirani, R.: Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical Science 1(1), 54–77 (1986)
Article MathSciNet Google Scholar
Doddington, G.: Automatic Evaluation of Language Translation using N-gram Co-occurrence Statistics. Presentation at DARPA/TIDES 2003 MT Workshop. NIST, Gathersberg, MD (July 2003)
Google Scholar
Pang, B., Knight, K., Marcu, D.: Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences. In: Proceedings of HLT-NAACL 2003, Edmonton, Canada, May 2003, pp. 102–109 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Language Technologies Institute, Carnegie Mellon University,
Alon Lavie, Kenji Sagae & Shyamsundar Jayaraman

Authors

Alon Lavie
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Sagae
View author publications
You can also search for this author in PubMed Google Scholar
Shyamsundar Jayaraman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Language Technologies Institute, Carnegie Mellon University, 5000 Forbes Avenue, 15213, Pittsburgh, PA, USA
Robert E. Frederking
Intelligence Technology Innovation Center, 20505, Washington, D.C., USA
Kathryn B. Taylor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lavie, A., Sagae, K., Jayaraman, S. (2004). The Significance of Recall in Automatic Metrics for MT Evaluation. In: Frederking, R.E., Taylor, K.B. (eds) Machine Translation: From Real Users to Research. AMTA 2004. Lecture Notes in Computer Science(), vol 3265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30194-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-30194-3_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23300-8
Online ISBN: 978-3-540-30194-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics