Improving Shard Selection for Selective Search

Chuang, Mon Shih; Kulkarni, Anagha

doi:10.1007/978-3-319-70145-5_3

Mon Shih Chuang²³ &
Anagha Kulkarni²³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10648))

Included in the following conference series:

Asia Information Retrieval Symposium

616 Accesses

Abstract

The Selective Search approach processes large document collections efficiently by partitioning the collection into topically homogeneous groups (shards), and searching only a few shards that are estimated to contain relevant documents for the query. The ability to identify the relevant shards for the query, directly impacts Selective Search performance. We thus investigate three new approaches for the shard ranking problem, and three techniques to estimate how many of the top shards should be searched for a query (shard rank cutoff estimation). We learn a highly effective shard ranking model using the popular learning-to-rank framework. Another approach leverages the topical organization of the collection along with pseudo relevance feedback (PRF) to improve the search performance further. Empirical evaluation using a large collection demonstrates statistically significant improvements over strong baselines. Experiments also show that shard cutoff estimation is essential to balance search precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aly, R., Hiemstra, D., Demeester, T.: Taily: shard selection using the tail of score distributions. In: Proceedings of the SIGIR Conference, pp. 673–682. ACM (2013)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Burges, C.J.C.: From ranknet to lambdarank to lambdamart: an overview. Learning 11(23–581), 81 (2010)
Google Scholar
Callan, J., Lu, Z., Croft, B.: Searching distributed collections with inference networks. In: Proceedings of the SIGIR Conference, pp. 21–28. ACM (1995)
Google Scholar
Callan, J.: Distributed information retrieval. In: Croft, W.B. (ed.) Advances in Information Retrieval. The Information Retrieval Series, vol. 7, pp. 127–150. Springer, Boston (2002). doi:10.1007/0-306-47019-5_5
Chapter Google Scholar
Chuang, M.-S., Kulkarni, A.: Balancing precision and recall with selective search. In: Proceedings of 4th Annual International Symposium on Information Management and Big Data (2017)
Google Scholar
Croft, B., Callan, J.: Lemur project (2000)
Google Scholar
Dai, Z., Kim, Y., Callan, J.: Learning to rank resources. In: Proceedings of the SIGIR Conference, pp. 837–840. ACM (2017)
Google Scholar
Dang, V.: Lemur project components: Ranklib (2013)
Google Scholar
Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4, 933–969 (2003)
MATH MathSciNet Google Scholar
Gravano, L., Garcia-Molina, H.: Generalizing gloss to vector-space databases and broker hierarchies. Technical report, Stanford InfoLab (1999)
Google Scholar
Gravano, L., Garcia-Molina, H., Tomasic, A.: The effectiveness of gioss for the text database discovery problem. ACM SIGMOD Rec. 23, 126–137 (1994). ACM
Article Google Scholar
Gravano, L., García-Molina, H., Tomasic, A.: Gloss: text-source discovery over the internet. ACM Trans. Database Syst. (TODS) 24(2), 229–264 (1999)
Article Google Scholar
Kanoulas, E., Dai, K., Pavlu, V., Aslam, J.A.: Score distribution models: assumptions, intuition, and robustness to score manipulation. In: Proceedings of the SIGIR Conference, pp. 242–249. ACM (2010)
Google Scholar
Kim, Y., Callan, J., Culpepper, J.S., Moffat, A.: Does selective search benefit from WAND optimization? In: Ferro, N., et al. (eds.) ECIR 2016. LNCS, vol. 9626, pp. 145–158. Springer, Cham (2016). doi:10.1007/978-3-319-30671-1_11
Chapter Google Scholar
Kim, Y., Callan, J., Culpepper, J.S., Moffat, A.: Efficient distributed selective search. Inf. Retr. J. 20(3), 221–252 (2017)
Article Google Scholar
Kulkarni, A.: ShRkC: shard rank cutoff prediction for selective search. In: Iliopoulos, C., Puglisi, S., Yilmaz, E. (eds.) SPIRE 2015. LNCS, vol. 9309, pp. 337–349. Springer, Cham (2015). doi:10.1007/978-3-319-23826-5_32
Chapter Google Scholar
Kulkarni, A., Callan, J.: Selective search: efficient and effective search of large textual collections. ACM Trans. Inf. Syst. (TOIS) 33(4), 17 (2015)
Article Google Scholar
Kulkarni, A., Pedersen, T.: How many different “john smiths”, and who are they? In: AAAI, pp. 1885–1886 (2006)
Google Scholar
Kulkarni, A., Tigelaar, A.S., Hiemstra, D., Callan, J.: Shard ranking and cutoff estimation for topically partitioned collections. In: Proceedings of the CIKM Conference, pp. 555–564. ACM (2012)
Google Scholar
Markov, I., Crestani, F.: Theoretical, qualitative, and quantitative analyses of small-document approaches to resource selection. ACM Trans. Inf. Syst. (TOIS) 32(2), 9 (2014)
Article Google Scholar
Shokouhi, M.: Central-rank-based collection selection in uncooperative distributed information retrieval. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 160–172. Springer, Heidelberg (2007). doi:10.1007/978-3-540-71496-5_17
Chapter Google Scholar
Shokouhi, M., Si, L., et al.: Federated search. Found. Trends® Inf. Retr. 5(1), 1–102 (2011)
Article Google Scholar
Si, L., Callan, J.: Relevant document distribution estimation method for resource selection. In: Proceedings of the SIGIR Conference, pp. 298–305. ACM (2003)
Google Scholar
Strohman, T., Metzler, D., Turtle, H., Croft, W.B.: Indri: a language model-based search engine for complex queries. In: Proceedings of the International Conference on Intelligent Analysis, vol. 2, pp. 2–6. Citeseer, 2005
Google Scholar
Thomas, P., Shokouhi, M.: Sushi: scoring scaled samples for server selection. In: Proceedings of the SIGIR Conference, pp. 419–426. ACM (2009)
Google Scholar
van Rijsbergen, C.J.: Information Retrieval. Butterworth, London (1979)
Google Scholar
Xu, J., Li, H.: Adarank: a boosting algorithm for information retrieval. In: Proceedings of the SIGIR Conference, pp. 391–398. ACM (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, San Francisco State University, 1600 Holloway Ave, San Francisco, CA, 94132, USA
Mon Shih Chuang & Anagha Kulkarni

Authors

Mon Shih Chuang
View author publications
You can also search for this author in PubMed Google Scholar
Anagha Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mon Shih Chuang or Anagha Kulkarni .

Editor information

Editors and Affiliations

Korea Institute of Science and Technology Information, Daejeon, Korea (Republic of)
Won-Kyung Sung
Korea Institute of Science and Technology Information, Daejeon, Korea (Republic of)
Hanmin Jung
Beijing University of Technology, Beijing, China
Shuo Xu
Burapha University, Chonburi, Thailand
Krisana Chinnasarn
Kwansei Gakuin University, Himeji, Hyogo, Japan
Kazutoshi Sumiya
Korea Institute of Science and Technology Information, Daejeon, Korea (Republic of)
Jeonghoon Lee
Renmin University of China, Beijing, China
Zhicheng Dou
Georgetown University, Washington, District of Columbia, USA
Grace Hui Yang
Konkuk University, Seoul, Korea (Republic of)
Young-Guk Ha
Korea Institute of Science and Technology Information, Daejeon, Korea (Republic of)
Seungbock Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chuang, M.S., Kulkarni, A. (2017). Improving Shard Selection for Selective Search. In: Sung, WK., et al. Information Retrieval Technology. AIRS 2017. Lecture Notes in Computer Science(), vol 10648. Springer, Cham. https://doi.org/10.1007/978-3-319-70145-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-70145-5_3
Published: 08 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-70144-8
Online ISBN: 978-3-319-70145-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics