Online Experimentation for Information Retrieval

Hofmann, Katja

doi:10.1007/978-3-319-25485-2_2

Katja Hofmann¹⁴

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 505))

Included in the following conference series:

Russian Summer School in Information Retrieval

Abstract

Online experimentation for information retrieval (IR) focuses on insights that can be gained from user interactions with IR systems, such as web search engines. The most common form of online experimentation, A/B testing, is widely used in practice, and has helped sustain continuous improvement of the current generation of these systems.

As online experimentation is taking a more and more central role in IR research and practice, new techniques are being developed to address, e.g., questions regarding the scale and fidelity of experiments in online settings. This paper gives an overview of the currently available tools. This includes techniques that are already in wide use, such as A/B testing and interleaved comparisons, as well as techniques that have been developed more recently, such as bandit approaches for online learning to rank.

This paper summarizes and connects the wide range of techniques and insights that have been developed in this field to date. It concludes with an outlook on open questions and directions for ongoing and future research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

1.
The terminology comes from the area of reinforcement learning, a type of machine learning in which an intelligent agent (e.g., an interactive IR system) learns from interactions with its environment (e.g., users) by trying out actions and observing rewards. This is a natural model for learning in interactive IR, and is discussed in more detail in Sect. 6.
2.
A policy defines a distribution over system actions, often conditioned on additional information, such as the history of previous interactions, or information about context, such as a query posed by the user.

References

Agrawal, S., Goyal, N.: Analysis of thompson sampling for the multi-armed bandit problem. In: COLT 2012 (2012)
Google Scholar
Ailon, N., Karnin, Z., Joachims, T.: Reducing dueling bandits to cardinal bandits. In: ICML 2014 (2014)
Google Scholar
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002)
Article MATH Google Scholar
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: The nonstochastic multiarmed bandit problem. SIAM J. Comput. 32(1), 48–77 (2002)
Article MathSciNet MATH Google Scholar
Babbie, E.R.: The Practice of Social Research, 13th edn. Cengage Learning, Boston (2012)
Google Scholar
Balog, K., Kelly, L., Schuth, A.: Head first: Living labs for ad-hoc search evaluation. In: CIKM 2014 (2014)
Google Scholar
Bendersky, M., Garcia-Pueyo, L., Harmsen, J., Josifovski, V., Lepikhin, D.: Up next: Retrieval methods for large scale related video suggestion. In: KDD 2014 (2014)
Google Scholar
Bottou, L., Chickering, J., Portugaly, E., Ray, D., Simard, P., Snelson, E.: Counterfactual reasoning and learning systems: The example of computational advertising. J. Mach. Learn. Res. 14(1), 3207–3260 (2013)
MathSciNet MATH Google Scholar
Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Found. Trends Mach. Learn. 5(1), 1–122 (2012)
Article MATH Google Scholar
Busa-Fekete, R., Hüllermeier, E.: A survey of preference-based online learning with bandit algorithms. In: Auer, P., Clark, A., Zeugmann, T., Zilles, S. (eds.) ALT 2014. LNCS, vol. 8776, pp. 18–39. Springer, Heidelberg (2014)
Chapter MATH Google Scholar
Carterette, B.: Statistical significance testing in information retrieval: Theory and practice. In: ICTIR 2013 (2013)
Google Scholar
Chakraborty, S., Radlinski, F., Shokouhi, M., Baecke, P.: On correlation of absence time and search effectiveness. In: SIGIR 2014, pp. 1163–1166 (2014)
Google Scholar
Chapelle, O., Li, L.: An empirical evaluation of thompson sampling. In: NIPS 2011, pp. 2249–2257 (2011)
Google Scholar
Chapelle, O., Zhang, Y.: A dynamic bayesian network click model for web search ranking. In: WWW 2009, pp. 1–10 (2009)
Google Scholar
Chapelle, O., Joachims, T., Radlinski, F., Yue, Y.: Large-scale validation and analysis of interleaved search evaluation. ACM Trans. Inf. Syst. 30(1), 6:1–6:41 (2012)
Article Google Scholar
Chuklin, A., Schuth, A., Hofmann, K., Serdyukov, P., de Rijke, M.: Evaluating aggregated search using interleaving. In: CIKM 2013 (2013)
Google Scholar
Chuklin, A., Schuth, A., Zhou, K., de Rijke, M.: A comparative analysis of interleaving methods for aggregated search. ACM Trans. Inf. Syst. (2014)
Google Scholar
Craswell, N., Zoeter, O., Taylor, M., Ramsey, B.: An experimental comparison of click position-bias models. In: WSDM 2008, pp. 87–94 (2008)
Google Scholar
Deng, A., Xu, Y., Kohavi, R., Walker, T.: Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In: WSDM 2013, pp. 123–132 (2013)
Google Scholar
Diaz, F.: Adaptation of offline vertical selection predictions in the presence of user feedback. In: SIGIR 2009, pp. 323–330 (2009)
Google Scholar
Dupret, G., Lalmas, M.: Absence time and user engagement. In: WSDM 2013, p. 173. ACM Press, New York, February 2013
Google Scholar
Granka, L.A., Joachims, T., Gay, G.: Eye-tracking analysis of user behavior in www search. In: SIGIR 2004, pp. 478–479 (2004)
Google Scholar
Guan, Z., Cutrell, E.: An eye tracking study of the effect of target rank on web search. In: CHI 2007, pp. 417–420 (2007)
Google Scholar
Hassan, A., White, R.W.: Personalized models of search satisfaction. In: CIKM 2013, pp. 2009–2018 (2013)
Google Scholar
Hofmann, K., Whiteson, S., de Rijke, M.: Balancing exploration and exploitation in learning to rank online. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 251–263. Springer, Heidelberg (2011)
Chapter Google Scholar
Hofmann, K., Whiteson, S., de Rijke, M.: A probabilistic method for inferring preferences from clicks. In: CIKM 2011, pp. 249–258 (2011)
Google Scholar
Hofmann, K., Behr, F., Radlinski, F.: On caption bias in interleaving experiments. In: CIKM 2012, pp. 115–124. ACM Press (2012)
Google Scholar
Hofmann, K., Whiteson, S., de Rijke, M.: Estimating interleaved comparison outcomes from historical click data. In: CIKM 2012, pp. 1779–1783 (2012)
Google Scholar
Hofmann, K., Whiteson, S., de Rijke, M.: Balancing exploration and exploitation in listwise and pairwise online learning to rank for information retrieval. Inf. Retrieval J. 16(1), 63–90 (2013)
Article Google Scholar
Hofmann, K., Whiteson, S., de Rijke, M.: Fidelity, soundness, and efficiency of interleaved comparison methods. ACM Trans. Inf. Syst. 31(4), 1–43 (2013)
Article Google Scholar
Hofmann, K., Mitra, B., Radlinski, F., Shokouhi, M.: An eye-tracking study of user interactions with query auto completion. In: CIKM 2014 (2014)
Google Scholar
Jie, L., Lamkhede, S., Sapra, R., Hsu, E., Song, H., Chang, Y.: A unified search federation system based on online user feedback. In: KDD 2013, pp. 1195–1203 (2013)
Google Scholar
Jin, X., Sloan, M., Wang, J.: Interactive exploratory search for multi page search results. In: WWW 2013, pp. 655–666 (2013)
Google Scholar
Joachims, T.: Optimizing search engines using clickthrough data. In: KDD 2002, pp. 133–142 (2002)
Google Scholar
Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., Gay, G.: Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst. 25(2), 1–26 (2007)
Article Google Scholar
Kaufmann, E., Korda, N., Munos, R.: Thompson sampling: an asymptotically optimal finite-time analysis. In: Bshouty, N.H., Stoltz, G., Vayatis, N., Zeugmann, T. (eds.) ALT 2012. LNCS, vol. 7568, pp. 199–213. Springer, Heidelberg (2012)
Chapter Google Scholar
Kazai, G., Kamps, J., Koolen, M., Milic-Frayling, N.: Crowdsourcing for book search evaluation: Impact of hit design on comparative system ranking. In: SIGIR 2011, pp. 205–214 (2011)
Google Scholar
Kelly, D.: Methods for evaluating interactive information retrieval systems with users. Found. Trends Inf. Retrieval 3(1–2), 1–224 (2009)
Google Scholar
Kelly, D., Teevan, J.: Implicit feedback for inferring user preference: a bibliography. SIGIR Forum 37(2), 18–28 (2003)
Article Google Scholar
Kelly, D., Gyllstrom, K., Bailey, E.W.: A comparison of query and term suggestion features for interactive searching. In: SIGIR 2009, p. 371. ACM Press, New York, July 2009
Google Scholar
Kim, Y., Hassan, A., White, R.W., Zitouni, I.: Modeling dwell time to predict click-level satisfaction. In: WSDM 2014, pp. 193–202. ACM, New York (2014)
Google Scholar
Kleinberg, R., Slivkins, A., Upfal, E.: Multi-armed bandits in metric spaces. In: STOC 2008. ACM Press (2008)
Google Scholar
Kohavi, R., Longbotham, R., Sommerfield, D., Henne, R.M.: Controlled experiments on the web: survey and practical guide. Data Min. Knowl. Disc. 18(1), 140–181 (2009)
Article MathSciNet Google Scholar
Kohavi, R., Deng, A., Frasca, B., Longbotham, R., Walker, T., Xu, Y.: Trustworthy online controlled experiments: Five puzzling outcomes explained. In: KDD 2012, pp. 786–794. ACM, New York (2012)
Google Scholar
Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., Pohlmann, N.: Online controlled experiments at large scale. In: KDD 2013, pp. 1168–1176. ACM, New York (2013)
Google Scholar
Kohli, P., Salek, M., Stoddard, G.: A fast bandit algorithm for recommendation to users with heterogenous tastes. In: AAAI 2013 (2013)
Google Scholar
Langford, J., Zhang, T.: The epoch-greedy algorithm for multi-armed bandits with side information. In: NIPS 2008, pp. 817–824 (2008)
Google Scholar
Langford, J., Strehl, A., Wortman, J.: Exploration scavenging. In: ICML 2008, pp. 528–535 (2008)
Google Scholar
Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: WWW 2010, pp. 661–670 (2010)
Google Scholar
Li, L., Chu, W., Langford, J., Wang, X.: Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In: WSDM 2011, pp. 297–306 (2011)
Google Scholar
Li, L., Chen, S., Kleban, J., Gupta, A.: Couterfactual estimation and optimization of click metrics for search engines (2014). arXiv preprint arXiv:1403.1891
Luo, J., Zhang, S., Yang, H.: Win-win search: Dual-agent stochastic game in session search. In: SIGIR 2014, pp. 587–596. ACM (2014)
Google Scholar
Mahajan, D.K., Rastogi, R., Tiwari, C., Mitra, A.: LogUCB: An explore-exploit algorithm for comments recommendation. In: CIKM 2012, pp. 6–15 (2012)
Google Scholar
Pearl, J.: Causality: Models, Reasoning and Inference, vol. 29. Cambridge University Press, Cambridge (2000)
MATH Google Scholar
Pearl, J.: An introduction to causal inference. Int. J. Biostatistics 6(2) (2010)
Google Scholar
Precup, D., Sutton, R.S., Singh, S.P.: Eligibility traces for off-policy policy evaluation. In: ICML 2000, pp. 759–766 (2000)
Google Scholar
Radlinski, F., Craswell, N.: Comparing the sensitivity of information retrieval metrics. In: SIGIR 2010, pp. 667–674 (2010)
Google Scholar
Radlinski, F., Craswell, N.: Optimized interleaving for online retrieval evaluation. In: WSDM 2013 (2013)
Google Scholar
Radlinski, F., Joachims, T.: Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In: AAAI 2006, p. 1406 (2006)
Google Scholar
Radlinski, F., Kleinberg, R., Joachims, T.: Learning diverse rankings with multi-armed bandits. In: ICML 2008, pp. 784–791. ACM (2008)
Google Scholar
Radlinski, F., Kurup, M., Joachims, T.: How does clickthrough data reflect retrieval quality?. In: CIKM 2008, pp. 43–52 (2008)
Google Scholar
Russo, D., Roy, B.V.: An information-theoretic analysis of thompson sampling. CoRR, abs/1403.5341 (2014). URL http://arxiv.org/abs/1403.5341
Sanderson, M.: Test collection based evaluation of information retrieval systems. Found. Trends Inf. Retrieval 4(4), 247–375 (2010)
Article MATH Google Scholar
Scholer, F., Shokouhi, M., Billerbeck, B., Turpin, A.: Using clicks as implicit judgments: expectations versus observations. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 28–39. Springer, Heidelberg (2008)
Chapter Google Scholar
Schuth, A., Hofmann, K., Whiteson, S., de Rijke, M.: Lerot: an online learning to rank framework. In: LivingLab 2013, pP. 23–26. ACM (2013)
Google Scholar
Schuth, A., Sietsma, F., Whiteson, S., de Rijke, M.: Optimizing base rankers using clicks. In: de Rijke, M., Kenter, T., de Vries, A.P., Zhai, C.X., de Jong, F., Radinsky, K., Hofmann, K. (eds.) ECIR 2014. LNCS, vol. 8416, pp. 75–87. Springer, Heidelberg (2014)
Chapter Google Scholar
Schuth, A., Sietsma, F., Whiteson, S., Lefortier, D., de Rijke, M.: Multileaved comparisons for fast online evaluation. In: CIKM 2014 (2014)
Google Scholar
Slivkins, A., Radlinski, F., Gollapudi, S.: Ranked bandits in metric spaces: learning diverse rankings over large document collections. J. Mach. Learn. Res. 14(1), 399–436 (2013)
MathSciNet MATH Google Scholar
Song, Y., Shi, X., Fu, X.: Evaluating and predicting user engagement change with degraded search relevance. In: WWW 2013, pp. 1213–1224 (2013)
Google Scholar
Streeter, M., Golovin, D., Krause, A.: Online learning of assignments. In: NIPS 2009, pp. 1794–1802 (2009)
Google Scholar
Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning. MIT Press, Cambridge (1998)
Book MATH Google Scholar
Tang, D., Agarwal, A., O’Brien, D., Meyer, M.: Overlapping experiment infrastructure: More, better, faster experimentation. In: KDD 2010, pp. 17–26 (2010)
Google Scholar
Tang, L., Rosales, R., Singh, A., Agarwal, D.: Automatic ad format selection via contextual bandits. In: CIKM 2013, pp. 1587–1594 (2013)
Google Scholar
Valko, M., Carpentier, A., Munos, R.: Stochastic simultaneous optimistic optimization. In: ICML 2013, pp. 19–27 (2013)
Google Scholar
Voorhees, E.M., Harman, D.K.: TREC: Experiment and Evaluation in Information Retrieval. Digital Libraries and Electronic Publishing. MIT Press, Cambridge (2005)
Google Scholar
Wang, K., Walker, T., Zheng, Z.: PSkip: estimating relevance ranking quality from web search clickthrough data. In: KDD 2009, pp. 1355–1364 (2009)
Google Scholar
Watkins, C.J.C.H.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge (1989)
Google Scholar
Yue, Y., Guestrin, C.: Linear submodular bandits and their application to diversified retrieval. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K. (eds.) NIPS 2011, pp. 2483–2491 (2011)
Google Scholar
Yue, Y., Joachims, T.: Interactively optimizing information retrieval systems as a dueling bandits problem. In: ICML 2009, pp. 1201–1208 (2009)
Google Scholar
Yue, Y., Joachims, T.: Beat the mean bandit. In: ICML 2011 (2011)
Google Scholar
Yue, Y., Broder, J., Kleinberg, R., Joachims, T.: The K-armed dueling bandits problem. In: COLT 2009 (2009)
Google Scholar
Yue, Y., Patel, R., Roehrig, H.: Beyond position bias: examining result attractiveness as a source of presentation bias in clickthrough data. In: WWW 2010, pp. 1011–1018 (2010)
Google Scholar
Yue, Y., Broder, J., Kleinberg, R., Joachims, T.: The K-armed dueling bandits problem. J. Comput. Syst. Sci. 78(5), 1538–1556 (2012)
Article MathSciNet MATH Google Scholar
Zoghi, M., Whiteson, S.A., de Rijke, M., Munos, R.: Relative confidence sampling for efficient on-line ranker evaluation. In: WSDM 2014, pp. 73–82 (2014)
Google Scholar
Zoghi, M., Whiteson, S.A., Munos, R., de Rijke, M.: Relative upper confidence bound for the K-armed dueling bandit problem. In: ICML 2014 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, Cambridge, UK
Katja Hofmann

Authors

Katja Hofmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Katja Hofmann .

Editor information

Editors and Affiliations

Ural Federal University, Yekaterinburg, Russia
Pavel Braslavski
National Research University Higher School of Economics, Nizhniy Novgorod, Russia
Nikolay Karpov
Intelligent Systems Laboratory, University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Barcelona Media Research Foundation, Barcelona, Spain
Yana Volkovich
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hofmann, K. (2015). Online Experimentation for Information Retrieval. In: Braslavski, P., Karpov, N., Worring, M., Volkovich, Y., Ignatov, D.I. (eds) Information Retrieval. RuSSIR 2014. Communications in Computer and Information Science, vol 505. Springer, Cham. https://doi.org/10.1007/978-3-319-25485-2_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-25485-2_2
Published: 10 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25484-5
Online ISBN: 978-3-319-25485-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics