Skip to main content

Improving Shard Selection for Selective Search

  • Conference paper
  • First Online:
Information Retrieval Technology (AIRS 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10648))

Included in the following conference series:

  • 616 Accesses

Abstract

The Selective Search approach processes large document collections efficiently by partitioning the collection into topically homogeneous groups (shards), and searching only a few shards that are estimated to contain relevant documents for the query. The ability to identify the relevant shards for the query, directly impacts Selective Search performance. We thus investigate three new approaches for the shard ranking problem, and three techniques to estimate how many of the top shards should be searched for a query (shard rank cutoff estimation). We learn a highly effective shard ranking model using the popular learning-to-rank framework. Another approach leverages the topical organization of the collection along with pseudo relevance feedback (PRF) to improve the search performance further. Empirical evaluation using a large collection demonstrates statistically significant improvements over strong baselines. Experiments also show that shard cutoff estimation is essential to balance search precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aly, R., Hiemstra, D., Demeester, T.: Taily: shard selection using the tail of score distributions. In: Proceedings of the SIGIR Conference, pp. 673–682. ACM (2013)

    Google Scholar 

  2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  3. Burges, C.J.C.: From ranknet to lambdarank to lambdamart: an overview. Learning 11(23–581), 81 (2010)

    Google Scholar 

  4. Callan, J., Lu, Z., Croft, B.: Searching distributed collections with inference networks. In: Proceedings of the SIGIR Conference, pp. 21–28. ACM (1995)

    Google Scholar 

  5. Callan, J.: Distributed information retrieval. In: Croft, W.B. (ed.) Advances in Information Retrieval. The Information Retrieval Series, vol. 7, pp. 127–150. Springer, Boston (2002). doi:10.1007/0-306-47019-5_5

    Chapter  Google Scholar 

  6. Chuang, M.-S., Kulkarni, A.: Balancing precision and recall with selective search. In: Proceedings of 4th Annual International Symposium on Information Management and Big Data (2017)

    Google Scholar 

  7. Croft, B., Callan, J.: Lemur project (2000)

    Google Scholar 

  8. Dai, Z., Kim, Y., Callan, J.: Learning to rank resources. In: Proceedings of the SIGIR Conference, pp. 837–840. ACM (2017)

    Google Scholar 

  9. Dang, V.: Lemur project components: Ranklib (2013)

    Google Scholar 

  10. Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4, 933–969 (2003)

    MATH  MathSciNet  Google Scholar 

  11. Gravano, L., Garcia-Molina, H.: Generalizing gloss to vector-space databases and broker hierarchies. Technical report, Stanford InfoLab (1999)

    Google Scholar 

  12. Gravano, L., Garcia-Molina, H., Tomasic, A.: The effectiveness of gioss for the text database discovery problem. ACM SIGMOD Rec. 23, 126–137 (1994). ACM

    Article  Google Scholar 

  13. Gravano, L., García-Molina, H., Tomasic, A.: Gloss: text-source discovery over the internet. ACM Trans. Database Syst. (TODS) 24(2), 229–264 (1999)

    Article  Google Scholar 

  14. Kanoulas, E., Dai, K., Pavlu, V., Aslam, J.A.: Score distribution models: assumptions, intuition, and robustness to score manipulation. In: Proceedings of the SIGIR Conference, pp. 242–249. ACM (2010)

    Google Scholar 

  15. Kim, Y., Callan, J., Culpepper, J.S., Moffat, A.: Does selective search benefit from WAND optimization? In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 145–158. Springer, Cham (2016). doi:10.1007/978-3-319-30671-1_11

    Chapter  Google Scholar 

  16. Kim, Y., Callan, J., Culpepper, J.S., Moffat, A.: Efficient distributed selective search. Inf. Retr. J. 20(3), 221–252 (2017)

    Article  Google Scholar 

  17. Kulkarni, A.: ShRkC: shard rank cutoff prediction for selective search. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 337–349. Springer, Cham (2015). doi:10.1007/978-3-319-23826-5_32

    Chapter  Google Scholar 

  18. Kulkarni, A., Callan, J.: Selective search: efficient and effective search of large textual collections. ACM Trans. Inf. Syst. (TOIS) 33(4), 17 (2015)

    Article  Google Scholar 

  19. Kulkarni, A., Pedersen, T.: How many different “john smiths”, and who are they? In: AAAI, pp. 1885–1886 (2006)

    Google Scholar 

  20. Kulkarni, A., Tigelaar, A.S., Hiemstra, D., Callan, J.: Shard ranking and cutoff estimation for topically partitioned collections. In: Proceedings of the CIKM Conference, pp. 555–564. ACM (2012)

    Google Scholar 

  21. Markov, I., Crestani, F.: Theoretical, qualitative, and quantitative analyses of small-document approaches to resource selection. ACM Trans. Inf. Syst. (TOIS) 32(2), 9 (2014)

    Article  Google Scholar 

  22. Shokouhi, M.: Central-rank-based collection selection in uncooperative distributed information retrieval. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 160–172. Springer, Heidelberg (2007). doi:10.1007/978-3-540-71496-5_17

    Chapter  Google Scholar 

  23. Shokouhi, M., Si, L., et al.: Federated search. Found. Trends® Inf. Retr. 5(1), 1–102 (2011)

    Article  Google Scholar 

  24. Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proceedings of the SIGIR Conference, pp. 298–305. ACM (2003)

    Google Scholar 

  25. Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: a language model-based search engine for complex queries. In: Proceedings of the International Conference on Intelligent Analysis, vol. 2, pp. 2–6. Citeseer, 2005

    Google Scholar 

  26. Thomas, P., Shokouhi, M.: Sushi: scoring scaled samples for server selection. In: Proceedings of the SIGIR Conference, pp. 419–426. ACM (2009)

    Google Scholar 

  27. van Rijsbergen, C.J.: Information Retrieval. Butterworth, London (1979)

    Google Scholar 

  28. Xu, J., Li, H.: Adarank: a boosting algorithm for information retrieval. In: Proceedings of the SIGIR Conference, pp. 391–398. ACM (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Mon Shih Chuang or Anagha Kulkarni .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chuang, M.S., Kulkarni, A. (2017). Improving Shard Selection for Selective Search. In: Sung, WK., et al. Information Retrieval Technology. AIRS 2017. Lecture Notes in Computer Science(), vol 10648. Springer, Cham. https://doi.org/10.1007/978-3-319-70145-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-70145-5_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-70144-8

  • Online ISBN: 978-3-319-70145-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics