Skip to main content

The Significance of Recall in Automatic Metrics for MT Evaluation

  • Conference paper
Machine Translation: From Real Users to Research (AMTA 2004)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3265))

Included in the following conference series:

Abstract

Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that significantly better correlations can be achieved by placing more weight on recall than on precision. While this may seem unexpected, since BLEU and NIST focus on n-gram precision and disregard recall, our experiments show that correlation with human judgments is highest when almost all of the weight is assigned to recall. We also show that stemming is significantly beneficial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kishore, P., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, pp. 311–318 (July 2002)

    Google Scholar 

  2. Doddington, G.: Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics. In: Proceedings of the Second Conference on Human Language Technology (HLT 2002), San Diego, CA, pp. 128–132 (2002)

    Google Scholar 

  3. Su, K.-Y., Wu, M.-W., Chang, J.-S.: A New Quantitative Quality Measure for Machine Translation Systems. In: Proceedings of the fifteenth International Conference on Computational Linguistics (COLING 1992), Nantes, France, pp. 433–439 (1992)

    Google Scholar 

  4. Akiba, Y., Imamura, K., Sumita, E.: Using Multiple Edit Distances to Automatically Rank Machine Translation Output. In: Proceedings of MT Summit VIII, Santiago de Compostela, Spain, pp. 15–20 (2001)

    Google Scholar 

  5. Niessen, S., Och, F.J., Leusch, G., Ney, H.: An Evaluation Tool for Machine Translation: Fast Evaluation for Machine Translation Research. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000) Athens, Greece, pp. 39–45 (2000)

    Google Scholar 

  6. Leusch, G., Ueffing, N., Ney, H.: String-to-String Distance Measure with Applications to Machine Translation Evaluation. In: Proceedings of MT Summit IX, New Orleans, LA, September 2003, pp. 240–247 (2003)

    Google Scholar 

  7. Melamed, D., Green, R., Turian, J.: Precision and Recall of Machine Translation. In: Proceedings of HLT-NAACL 2003.Short Papers, Edmonton, Canada, May 2003, pp. 61–63 (2003)

    Google Scholar 

  8. Turian, J.P., Shen, L., Dan Melamed, I.: Evaluation of Machine Translation and its Evaluation. In: Proceedings of MT Summit IX, New Orleans, LA, September 2003, pp. 386–393 (2003)

    Google Scholar 

  9. Lin, C.-Y., Hovy, E.: Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In: Proceedings of HLT-NAACL 2003, Edmonton, Canada, May 2003, pp. 71–78 (2003)

    Google Scholar 

  10. van Rijsbergen, C.: Information Retrieval. Butterworths, 2nd edn., London, England (1979)

    Google Scholar 

  11. Coughlin, D.: Correlating Automated and Human Assessments of Machine Translation Quality. In: Proceedings of MT Summit IX, New Orleans, LA, September 2003, pp. 63–70 (2003)

    Google Scholar 

  12. Efron, B., Tibshirani, R.: Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical Science 1(1), 54–77 (1986)

    Article  MathSciNet  Google Scholar 

  13. Doddington, G.: Automatic Evaluation of Language Translation using N-gram Co-occurrence Statistics. Presentation at DARPA/TIDES 2003 MT Workshop. NIST, Gathersberg, MD (July 2003)

    Google Scholar 

  14. Pang, B., Knight, K., Marcu, D.: Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences. In: Proceedings of HLT-NAACL 2003, Edmonton, Canada, May 2003, pp. 102–109 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lavie, A., Sagae, K., Jayaraman, S. (2004). The Significance of Recall in Automatic Metrics for MT Evaluation. In: Frederking, R.E., Taylor, K.B. (eds) Machine Translation: From Real Users to Research. AMTA 2004. Lecture Notes in Computer Science(), vol 3265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30194-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30194-3_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23300-8

  • Online ISBN: 978-3-540-30194-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics