How to Run an Evaluation Task

Sakai, Tetsuya

doi:10.1007/978-3-030-22948-1_3

Tetsuya Sakai⁹

Part of the book series: The Information Retrieval Series ((INRE,volume 41))

718 Accesses
4 Citations
2 Altmetric

Abstract

This chapter provides a general guideline for researchers who are planning to run a shared evaluation task for the first time, with a primary focus on simple ad hoc Information Retrieval (IR). That is, it is assumed that we have a static target document collection and a set of test topics (i.e., search requests), where participating systems are required to produce a ranked list of documents for each topic. The chapter provides a step-by-step description of what a task organiser team is expected to do. Section 1 discusses how to define the evaluation task; Sect. 2 how to publicise it and why it is important. Section 3 describes how to design and build test collections, as well as how inter-assessor agreement can be quantified. Section 4 explains how the results submitted by participants can be evaluated; examples of tools for computing evaluation measures and conducting statistical significance tests are provided. Finally, Sect. 5 discusses how the fruits of running the task should be shared to the research community, how progress should be monitored, and how we may be able to improve the task design for the next round. N.B.: A prerequisite to running a successful task is that you have a good team of organisers who can collaborate effectively. Each team member should be well-motivated and committed to running the task. They should respond to emails in a timely manner and should be able to meet deadlines. Organisers should be well-organised!

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allan J, Carterette B, Aslam JA, Pavlu V, Dachev B, Kanoulas E (2008) Million query track 2007 overview. In: Proceedings of TREC 2007
Google Scholar
Alonso O, Rose DE, Stewart B (2008) Crowdsourcing for relevance evaluation. SIGIR Forum 42(2):9
Article Google Scholar
Bailey P, Craswell N, Soboroff I, Thomas P, de Vries AP, Yilmaz E (2008) Relevance assessment: are judges exchangeable and does it matter? In: Proceedings of ACM SIGIR 2008, pp 667–674
Google Scholar
Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of ACM SIGIR 2004, pp 25–32
Google Scholar
Buckley C, Voorhees EM (2005) Retrieval system evaluation. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. The MIT Press, Boston, chap 3
Google Scholar
Burges C, Shaked T, Renshaw E, Lazier A, Deeds M, Hamilton N, Hullender G (2005) Learning to rank using gradient descent. In: Proceedings of ACM ICML 2005, pp 89–96
Google Scholar
Carterette B (2012) Multiple testing in statistical analysis of systems-based information retrieval experiments. ACM TOIS 30(1):4
Article Google Scholar
Carterette B (2015) Bayesian inference for information retrieval evaluation. In: Proceedings of ACM ICTIR 2015, pp 31–40
Article Google Scholar
Carterette B, Bennett PN, Chickering DM, Dumais ST (2008a) Here or there: preference judgments for relevance. In: Proceedings of ECIR 2008 (LNCS), vol 4956, pp 16–27
Google Scholar
Carterette B, Pavlu V, Kanoulas E, Aslam JA, Allan J (2008b) Evaluation over thousands of queries. In: Proceedings of ACM SIGIR 2008, pp 651–658
Google Scholar
Chandar P, Carterette B (2012) Using preference judgments for novel document retrieval. In: Proceedings of ACM SIGIR 2012, pp 861–870
Article Google Scholar
Chapelle O, Metzler D, Zhang Y, Grinspan P (2009) Expected reciprocal rank for graded relevance. In: Proceedings of ACM CIKM 2009, pp 621–630
Google Scholar
Chapelle O, Ji S, Liao C, Velipasaoglu E, Lai L, Wu SL (2011) Intent-based diversification of web search results: metrics and algorithms. Inf Retr 14(6):572–592
Article Google Scholar
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37–46
Article Google Scholar
Cohen J (1968) Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull 70(4):213–220
Article Google Scholar
Crawley MJ (2015) Statistics: an introduction using R, 2nd edn. Wiley, Chichester
MATH Google Scholar
Ekstrand-Abueg M, Pavlu V, Kato MP, Sakai T, Yamamoto T, Iwata M (2013) Exploring semi-automatic nugget extraction for Japanese one click access evaluation. In: Proceedings of ACM SIGIR 2013, pp 749–752
Google Scholar
Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382
Article Google Scholar
Harman DK (2005) The TREC test collections. In: Voorhees EM, Harman DK (eds) TREC: experiment and evaluation in information retrieval. The MIT Press, Boston, chap 2
Google Scholar
Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM TOIS 20(4):422–446
Article Google Scholar
Krippendorff K (2013) Content analysis: an introduction to its methodology, 3rd edn. SAGE Publications, Los Angeles
Google Scholar
Lease M, Yilmaz E (2011) Crowdsourcing for information retrieval. SIGIR Forum 45(2):66–75
Article Google Scholar
Luo C, Sakai T, Liu Y, Dou Z, Xiong C, Xu J (2017) Overview of the NTCIR-13 we want web task. In: Proceedings of NTCIR-13
Google Scholar
Moffat A, Zobel J (2008) Rank-biased precision for measurement of retrieval effectiveness. ACM TOIS 27(1):2
Article Google Scholar
Nagata Y (2003) How to design the sample size (in Japanese). Asakura Shoten
Google Scholar
Randolph JJ (2005) Free-marginal multirater kappa (multirater κ _free): an alternative to Fleiss’ fixed marginal multirater kappa. In: Joensuu learning and instruction symposium 2005
Google Scholar
Sakai T (2004) Ranking the NTCIR systems based on multigrade relevance. In: Proceedings of AIRS 2004 (LNCS), vol 3411, pp 251–262
Chapter Google Scholar
Sakai T (2006) Evaluating evaluation metrics based on the bootstrap. In: Proceedings of ACM SIGIR 2006, pp 525–532
Google Scholar
Sakai T (2007) Alternatives to bpref. In: Proceedings of ACM SIGIR 2007, pp 71–78
Google Scholar
Sakai T (2014) Metrics, statistics, tests. In: PROMISE winter school 2013: bridging between information retrieval and databases (LNCS), vol 8173, pp 116–163
Google Scholar
Sakai T (2015) Information access evaluation methodology: for the progress of search engines (in Japanese). Corona Publishing, New York
Google Scholar
Sakai T (2016) Topic set size design. Inf Retr J 19(3):256–283
Article Google Scholar
Sakai T (2017a) The effect of inter-assessor disagreement on IR system evaluation: a case study with lancers and students. In: Proceedings of EVIA 2017, pp 31–38
Google Scholar
Sakai T (2017b) The probability that your hypothesis is correct, credible intervals, and effect sizes for ir evaluation. In: Proceedings of ACM SIGIR 2017, pp 25–34
Google Scholar
Sakai T (2018a) Laboratory experiments in information retrieval: sample sizes, effect sizes, and statistical power. Springer, Cham. https://link.springer.com/book/10.1007/978-981-13-1199-4
Book Google Scholar
Sakai T (2018b) Topic set size design for paired and unpaired data. In: Proceedings of ACM ICTIR 2018
Google Scholar
Sakai T, Lin CY (2010) Ranking retrieval systems without relevance assessments: revisited. In: Proceedings of EVIA 2010, pp 25–33
Google Scholar
Sakai T, Robertson S (2008) Modelling a user population for designing information retrieval metrics. In: Proceedings of EVIA 2008, pp 30–41
Google Scholar
Sakai T, Song R (2011) Evaluating diversified search results using per-intent graded relevance. In: Proceedings of ACM SIGIR 2011, pp 1043–1052
Google Scholar
Sakai T, Dou Z, Yamamoto T, Liu Y, Zhang M, Song R, Kato MP, Iwata M (2013) Overview of the NTCIR-10 INTENT-2 task. In: Proceedings of NTCIR-10, pp 94–123
Google Scholar
Smucker MD, Allan J, Carterette B (2007) A comparison of statistical significance tests for information retrieval evaluation. In: Proceedings of ACM CIKM 2007, pp 623–632
Google Scholar
Sparck Jones K, van Rijsbergen CJ (1975) Report on the need for and provision of an ‘ideal’ information retrieval test collection. Tech. rep., Computer Laboratory, University of Cambridge, British Library Research and Development Report No. 5266
Google Scholar
Voorhees EM (2000) Variations in relevance judgments and the measurement of retrieval effectiveness. Inf Process Manag 36:697–716
Article Google Scholar
Voorhees EM (2002) The philosophy of information retrieval evaluation. In: Proceedings of ECIR 2002 (LNCS), vol 2406, pp 355–370
Google Scholar
Webber W, Moffat A, Zobel J (2008) Statistical power in retrieval experimentation. In: Proceedings of ACM CIKM 2008, pp 571–580
Google Scholar
Yilmaz E, Aslam JA (2006) Estimating average precision with incomplete and imperfect judgments. In: Proceedings of ACM CIKM 2006, pp 102–111
Google Scholar
Zobel J (1998) How reliable are the results of large-scale information retrieval experiments? In: Proceedings of ACM SIGIR 1998, pp 307–314
Article Google Scholar

Download references

Author information

Authors and Affiliations

Waseda University, Tokyo, Japan
Tetsuya Sakai

Authors

Tetsuya Sakai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tetsuya Sakai .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Padova , Padova, Italy
Nicola Ferro
Consiglio Nazionale delle Ricerche, Istituto di Scienza e Tecnologie dell’Informazione, Pisa, Italy
Carol Peters

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Sakai, T. (2019). How to Run an Evaluation Task. In: Ferro, N., Peters, C. (eds) Information Retrieval Evaluation in a Changing World. The Information Retrieval Series, vol 41. Springer, Cham. https://doi.org/10.1007/978-3-030-22948-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-22948-1_3
Published: 14 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-22947-4
Online ISBN: 978-3-030-22948-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics