Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3944))

Included in the following conference series:

Abstract

In this paper, we discuss paradigms for evaluating open-domain semantic interpretation as they apply to the PASCAL Recognizing Textual Entailment (RTE) evaluation (Dagan et al. 2005). We focus on three aspects critical to a successful evaluation: creation of large quantities of reasonably good training data, analysis of inter-annotator agreement, and joint analysis of test item difficulty and test-taker proficiency (Rasch analysis). We found that although RTE does not correspond to a “real” or naturally occurring language processing task, it nonetheless provides clear and simple metrics, a tolerable cost of corpus development, good annotator reliability (with the potential to exploit the remaining variability), and the possibility of finding noisy but plentiful training material.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Aberdeen, J., Condon, S., Doran, C., Harper, L., Oshika, B., Phillips, J.: Evaluation of speech-to-speech translation systems (2005) (unpublished manuscript)

    Google Scholar 

  • Aberdeen, J., Hirschman, L., Walker, M.: Evaluation for DARPA Communicator spoken dialogue systems. In: Proceedings of the 2nd Conference on Language Resources and Evaluation (2000)

    Google Scholar 

  • Bayer, S., Burger, J., Ferro, L., Henderson, J., Yeh, A.: MITRE’s submissions to the EU Pascal RTE challenge. In: PASCAL Proceedings of the First Challenge Workshop, Recognizing Textual Entailment, Southampton, U.K. (2005)

    Google Scholar 

  • Bayer, S., Burger, J., Greiff, W., Wellner, B.: The MITRE logical form generation system. In: Proceedings of Senseval-3: The Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 69–72 (2004)

    Google Scholar 

  • Bond, T.G., Fox, C.M.: Applying the Rasch Model: Fundamental Measurement in the Human Sciences. University of Toledo Press (2001)

    Google Scholar 

  • Bos, J., Markert, K.: Combining shallow and deep NLP methods for recognizing textual entailment. In: PASCAL Proceedings of the First Challenge Workshop, Recognizing Textual Entailment, Southampton, U.K. (2005)

    Google Scholar 

  • Brachman, R. (AA)AI: More than the sum of its parts. AAAI Presidential Address. In: Presented at AAAI 2005 (2005)

    Google Scholar 

  • Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation. Computational Linguistics 19 (1993)

    Google Scholar 

  • Burger, J., Ferro, L.: Generating an entailment corpus from news headlines. In: ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, MI (2005)

    Google Scholar 

  • Dagan, I., Glickman, O., Magnini, B.: The PASCAL recognizing textual entailment challenge. In: PASCAL Proceedings of the First Challenge Workshop, Recognizing Textual Entailment, Southampton, U.K. (2005)

    Google Scholar 

  • Damianos, L., Wohlever, S., Kozierok, R., Ponte, J.: MiTAP for real users, real data, real problems. In: Proceedings of the Conference on Human Factors of Computing Systems, Fort Lauderdale, FL (2003)

    Google Scholar 

  • Deshmukh, N., Duncan, R., Ganapathiraju, A., Picone, J.: Benchmarking human performance for continuous speech recognition. In: Proceedings of the Fourth International Conference on Spoken Language Processing, Philadelphia, Pennsylvania, USA, pp. 2486–2489 (1996)

    Google Scholar 

  • Dolan, B., Brockett, C., Quirk, C.: Microsoft Research paraphrase corpus (2005), http://research.microsoft.com/research/nlp/msr_paraphrase.htm

  • Graff, D.: English Gigaword (2003), http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05

  • Grishman, R., Sundheim, B.: Design of the MUC-6 evaluation. In: Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD NIST. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

  • Gusfield, D.: Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  • Henderson, J., Morgan, W.: Paris: an automated MT evaluation metric toolkit; and a survey of metric performance on the segment ranking task. Technical report, MITRE (2005) (to appear)

    Google Scholar 

  • Hirschman, L.: The evolution of evaluation: Lessons from the message understanding conferences. Computer Speech and Language 12, 281–305 (1998)

    Article  Google Scholar 

  • Hirschman, L.: Language understanding evaluations: Lessons learned from MUC and ATIS. In: Proceedings of LREC 1998, Granada (1998)

    Google Scholar 

  • Hirschman, L., Bates, M., Dahl, D., Fisher, W.M., Garafolo, J., Pallet, D.S., Hunicke- Smith, K., Price, P., Rudnicky, A., Tzoukermann, E.: Multisite data collection and evaluation in spoken language understanding. In: Proceedings of the DARPA Workshop on Human Language Technology, Princeton, NJ, pp. 19–24 (1993)

    Google Scholar 

  • Hirschman, L., Light, M., Breck, E., Burger, J.D.: Deep Read: A reading comprehension system. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (1999)

    Google Scholar 

  • Hirschman, L., Yeh, A., Blaschke, C., Valencia, A.: Overview of BioCreAtIvE: Critical assessment of information extraction for biology. BMC Bioinformatics 6(suppl. 1) (2005)

    Google Scholar 

  • Joachims, T.: Learning to Classify Text Using Support Vector Machines. Kluwer, Dordrecht (2002)

    Book  Google Scholar 

  • Lange, R., Moran, J., Greiff, W., Ferro, L.: A probabilistic Rasch analysis of question answering evaluations. In: Proceedings of HLT-NAACL 2004, pp. 65–72 (2004)

    Google Scholar 

  • Light, M., Mann, G.S., Riloff, E., Breck, E.: Analyses for elucidating current question answering technology. Natural Language Engineering 7, 325–342 (2001)

    Article  Google Scholar 

  • Morgan, A., Hirschman, L., Colosimo, M., Yeh, A., Colombe, J.: Gene name identification and normalization using a model organism database. Journal of Biomedical Informatics 37, 396–410 (2004)

    Article  Google Scholar 

  • Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29 (2003)

    Google Scholar 

  • Papineni, K., Roukos, S., Ward, T., Henderson, J., Reeder, F.: Corpus-based comprehensive and diagnostic MT evaluation: Initial Arabic, Chinese, French, and Spanish results. In: Proceedings of the 2002 Conference on Human Language Technology, San Diego, CA, pp. 124–127 (2002)

    Google Scholar 

  • Sundheim, B.: Overview of results of the MUC-6 evaluation. In: Proceedings of the Sixth Message Understanding Conference (MUC-6), Columbia, MD. NIST. Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

  • Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. In: Proceedings of ICML 2000, 17th International Conference on Machine Learning (2000)

    Google Scholar 

  • Walker, M., Aberdeen, J., Boland, J., Bratt, E., Garofolo, J., Hirschman, L., Le, A., Lee, S., Narayanan, S., Papineni, K., Pellom, B., Polifroni, J., Potamianos, A., Prabhu, P., Rudnicky, A., Sanders, G., Seneff, S., Stallard, D., Whittaker, S.: DARPA Communicator dialog travel planning systems: The June 2000 data collection. In: Proceedings of Eurospeech 2001, Aalborg, Denmark (2001)

    Google Scholar 

  • Wellner, B., Ferro, L., Greiff, W., Hirschman, L.: Reading comprehension tests for computer-based understanding evaluation. Natural Language Engineering (2005) (to appear)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bayer, S., Burger, J., Ferro, L., Henderson, J., Hirschman, L., Yeh, A. (2006). Evaluating Semantic Evaluations: How RTE Measures Up. In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds) Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment. MLCW 2005. Lecture Notes in Computer Science(), vol 3944. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11736790_18

Download citation

  • DOI: https://doi.org/10.1007/11736790_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-33427-9

  • Online ISBN: 978-3-540-33428-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics