Evaluating Semantic Evaluations: How RTE Measures Up

Bayer, Sam; Burger, John; Ferro, Lisa; Henderson, John; Hirschman, Lynette; Yeh, Alex

doi:10.1007/11736790_18

Sam Bayer²²,
John Burger²²,
Lisa Ferro²²,
John Henderson²²,
Lynette Hirschman²² &
…
Alex Yeh²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3944))

Included in the following conference series:

Machine Learning Challenges Workshop

2275 Accesses
1 Citations

Abstract

In this paper, we discuss paradigms for evaluating open-domain semantic interpretation as they apply to the PASCAL Recognizing Textual Entailment (RTE) evaluation (Dagan et al. 2005). We focus on three aspects critical to a successful evaluation: creation of large quantities of reasonably good training data, analysis of inter-annotator agreement, and joint analysis of test item difficulty and test-taker proficiency (Rasch analysis). We found that although RTE does not correspond to a “real” or naturally occurring language processing task, it nonetheless provides clear and simple metrics, a tolerable cost of corpus development, good annotator reliability (with the potential to exploit the remaining variability), and the possibility of finding noisy but plentiful training material.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aberdeen, J., Condon, S., Doran, C., Harper, L., Oshika, B., Phillips, J.: Evaluation of speech-to-speech translation systems (2005) (unpublished manuscript)
Google Scholar
Aberdeen, J., Hirschman, L., Walker, M.: Evaluation for DARPA Communicator spoken dialogue systems. In: Proceedings of the 2nd Conference on Language Resources and Evaluation (2000)
Google Scholar
Bayer, S., Burger, J., Ferro, L., Henderson, J., Yeh, A.: MITRE’s submissions to the EU Pascal RTE challenge. In: PASCAL Proceedings of the First Challenge Workshop, Recognizing Textual Entailment, Southampton, U.K. (2005)
Google Scholar
Bayer, S., Burger, J., Greiff, W., Wellner, B.: The MITRE logical form generation system. In: Proceedings of Senseval-3: The Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 69–72 (2004)
Google Scholar
Bond, T.G., Fox, C.M.: Applying the Rasch Model: Fundamental Measurement in the Human Sciences. University of Toledo Press (2001)
Google Scholar
Bos, J., Markert, K.: Combining shallow and deep NLP methods for recognizing textual entailment. In: PASCAL Proceedings of the First Challenge Workshop, Recognizing Textual Entailment, Southampton, U.K. (2005)
Google Scholar
Brachman, R. (AA)AI: More than the sum of its parts. AAAI Presidential Address. In: Presented at AAAI 2005 (2005)
Google Scholar
Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation. Computational Linguistics 19 (1993)
Google Scholar
Burger, J., Ferro, L.: Generating an entailment corpus from news headlines. In: ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI (2005)
Google Scholar
Dagan, I., Glickman, O., Magnini, B.: The PASCAL recognizing textual entailment challenge. In: PASCAL Proceedings of the First Challenge Workshop, Recognizing Textual Entailment, Southampton, U.K. (2005)
Google Scholar
Damianos, L., Wohlever, S., Kozierok, R., Ponte, J.: MiTAP for real users, real data, real problems. In: Proceedings of the Conference on Human Factors of Computing Systems, Fort Lauderdale, FL (2003)
Google Scholar
Deshmukh, N., Duncan, R., Ganapathiraju, A., Picone, J.: Benchmarking human performance for continuous speech recognition. In: Proceedings of the Fourth International Conference on Spoken Language Processing, Philadelphia, Pennsylvania, USA, pp. 2486–2489 (1996)
Google Scholar
Dolan, B., Brockett, C., Quirk, C.: Microsoft Research paraphrase corpus (2005), http://research.microsoft.com/research/nlp/msr_paraphrase.htm
Graff, D.: English Gigaword (2003), http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05
Grishman, R., Sundheim, B.: Design of the MUC-6 evaluation. In: Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD NIST. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Henderson, J., Morgan, W.: Paris: an automated MT evaluation metric toolkit; and a survey of metric performance on the segment ranking task. Technical report, MITRE (2005) (to appear)
Google Scholar
Hirschman, L.: The evolution of evaluation: Lessons from the message understanding conferences. Computer Speech and Language 12, 281–305 (1998)
Article Google Scholar
Hirschman, L.: Language understanding evaluations: Lessons learned from MUC and ATIS. In: Proceedings of LREC 1998, Granada (1998)
Google Scholar
Hirschman, L., Bates, M., Dahl, D., Fisher, W.M., Garafolo, J., Pallet, D.S., Hunicke- Smith, K., Price, P., Rudnicky, A., Tzoukermann, E.: Multisite data collection and evaluation in spoken language understanding. In: Proceedings of the DARPA Workshop on Human Language Technology, Princeton, NJ, pp. 19–24 (1993)
Google Scholar
Hirschman, L., Light, M., Breck, E., Burger, J.D.: Deep Read: A reading comprehension system. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (1999)
Google Scholar
Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: Critical assessment of information extraction for biology. BMC Bioinformatics 6(suppl. 1) (2005)
Google Scholar
Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer, Dordrecht (2002)
Book Google Scholar
Lange, R., Moran, J., Greiff, W., Ferro, L.: A probabilistic Rasch analysis of question answering evaluations. In: Proceedings of HLT-NAACL 2004, pp. 65–72 (2004)
Google Scholar
Light, M., Mann, G.S., Riloff, E., Breck, E.: Analyses for elucidating current question answering technology. Natural Language Engineering 7, 325–342 (2001)
Article Google Scholar
Morgan, A., Hirschman, L., Colosimo, M., Yeh, A., Colombe, J.: Gene name identification and normalization using a model organism database. Journal of Biomedical Informatics 37, 396–410 (2004)
Article Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29 (2003)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Henderson, J., Reeder, F.: Corpus-based comprehensive and diagnostic MT evaluation: Initial Arabic, Chinese, French, and Spanish results. In: Proceedings of the 2002 Conference on Human Language Technology, San Diego, CA, pp. 124–127 (2002)
Google Scholar
Sundheim, B.: Overview of results of the MUC-6 evaluation. In: Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD. NIST. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. In: Proceedings of ICML 2000, 17th International Conference on Machine Learning (2000)
Google Scholar
Walker, M., Aberdeen, J., Boland, J., Bratt, E., Garofolo, J., Hirschman, L., Le, A., Lee, S., Narayanan, S., Papineni, K., Pellom, B., Polifroni, J., Potamianos, A., Prabhu, P., Rudnicky, A., Sanders, G., Seneff, S., Stallard, D., Whittaker, S.: DARPA Communicator dialog travel planning systems: The June 2000 data collection. In: Proceedings of Eurospeech 2001, Aalborg, Denmark (2001)
Google Scholar
Wellner, B., Ferro, L., Greiff, W., Hirschman, L.: Reading comprehension tests for computer-based understanding evaluation. Natural Language Engineering (2005) (to appear)
Google Scholar

Download references

Author information

Authors and Affiliations

The MITRE Corporation, 202 Burlington Road, Bedford, MA, 02144, USA
Sam Bayer, John Burger, Lisa Ferro, John Henderson, Lynette Hirschman & Alex Yeh

Authors

Sam Bayer
View author publications
You can also search for this author in PubMed Google Scholar
John Burger
View author publications
You can also search for this author in PubMed Google Scholar
Lisa Ferro
View author publications
You can also search for this author in PubMed Google Scholar
John Henderson
View author publications
You can also search for this author in PubMed Google Scholar
Lynette Hirschman
View author publications
You can also search for this author in PubMed Google Scholar
Alex Yeh
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Max Planck Institute for Biological Cybernetics, Spemannstr. 38, Tübingen, Germany
Joaquin Quiñonero-Candela
Bar Ilan University, 52900, Ramat Gan, Israel
Ido Dagan
ITC-IRST, Trento, Italy
Bernardo Magnini
Université d’Evry-Val d’Essonne, IBISC CNRS FRE 2873 and GENPOLE, 523, Place des terrasses, 91000, Evry, France
Florence d’Alché-Buc

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bayer, S., Burger, J., Ferro, L., Henderson, J., Hirschman, L., Yeh, A. (2006). Evaluating Semantic Evaluations: How RTE Measures Up. In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds) Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment. MLCW 2005. Lecture Notes in Computer Science(), vol 3944. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11736790_18

Download citation

DOI: https://doi.org/10.1007/11736790_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33427-9
Online ISBN: 978-3-540-33428-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics