Abstract
Inverted index is the preferred data structure for various query processing in large information systems, its compression techniques have long been studied to mitigate the dichotomy between space occupancy and decompression time. During compression, partitioning posting list into blocks aligning to its clustered distribution, can effectively minimize the compressed size while keeping partitions separately accessed. Traditional partitioning strategies using fixed-sized blocks trend to be easy to implement, but their compression effectiveness is vulnerable to outliers. Recently researchers begin to apply dynamic programming to determine optimal partitions with variable-sized blocks. However, these partitioning strategies sacrifice too much compression time. In this paper, we first compare performances of existing encoders in the space-time trade-off curve, then we present a faster algorithm to heuristically compute optimal partitions for the state-of-the-art Partitioned Elias-Fano index, taking account of compression time while maintaining the same approximation guarantees. Experimental results on TREC GOV2 document collection show that our method makes a significant improvement against its original version.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Anh, V.N., Moffat, A.: Index compression using fixed binary codewords. In: Proc. ADC. pp. 61–67 (2004)
Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Inform. Retrieval 8(1), 151–166 (2005)
Anh, V.N., Moffat, A.: Index compression using 64-bit words. Soft. Prac. & Exp. 40(2), 131–147 (2010)
Catena, M., Macdonald, C., Ounis, I.: On inverted index compression for search engine efficiency. In: Proc. ECIR. pp. 359–371 (2014)
Delbru, R., Campinas, S., Tummarello, G.: Searching web data: An entity retrieval and high-performance indexing model. J. Web Sem 10, 33–58 (2012)
Elias, P.: Efficient storage and retrieval by content and address of static files. Journal of the ACM (JACM) 21(2), 246–260 (1974)
Lai, C., Moulin, C.: Semantic indexing modelling of resources within a distributed system. International Journal of Grid and Utility Computing 4(1), 21–39 (2013)
Lemire, D., Boytsov, L.: Decoding billions of integers per second through vectorization. Soft. Prac. & Exp 45(1), 1–29 (2015)
Manning, C.D., Raghavan, P., Schütze, H., et al.: Introduction to information retrieval. Cambridge university press (2008)
Navarro, G., Puglisi, S.J.: Dual-sorted inverted lists. In: Proc. SPIRE. pp. 309–321 (2010)
Ottaviano, G., Tonellotto, N., Venturini, R.: Optimal space-time tradeoffs for inverted indexes. In: Proc. WSDM. pp. 47–56 (2015)
Ottaviano, G., Venturini, R.: Partitioned elias-fano indexes. In: Proc. SIGIR. pp. 273–282 (2014)
Petri, M., Moffat, A., Culpepper, J.S.: Score-safe term-dependency processing with hybrid indexes. In: Proc. SIGIR. pp. 899–902 (2014)
Shorfuzzaman, M., Graham, P., Eskicioglu, R.: Allocating replicas in large-scale data grids using a qos-aware distributed technique with workload constraints. International Journal of Grid and Utility Computing 3(2-3), 157–174 (2012)
Silvestri, F., Venturini, R.: Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: Proc. CIKM. pp. 1219–1228 (2010)
Stepanov, A.A., Gangolli, A.R., Rose, D.E., Ernst, R.J., Oberoi, P.S.: Simd-based decoding of posting lists. In: Proc. CIKM. pp. 317–326 (2011)
Trotman, A.: Compression, simd, and postings lists. In: Proc. ADCS. p. 50 (2014)
Tudor, D., Macariu, G., Schreiner, W., Cretu, V.I.: Experiences on grid shared data programming. International Journal of Grid and Utility Computing 1(4), 296–307 (2009)
Vigna, S.: Quasi-succinct indices. In: Proc. WSDM. pp. 83–92 (2013)
Wang, Y., Ma, J., Lu, X., Lu, D., Zhang, L.: Efficiency optimisation signature scheme for time-critical multicast data origin authentication. International Journal of Grid and Utility Computing 7(1), 1–11 (2016)
Witten, I.H., Moffat, A., Bell, T.C.: Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann (1999)
Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: Proc. WWW. pp. 401–410 (2009)
Zhang, T., Cui, L., Xu, M.: A lns-based data placement strategy for data-intensive e-science applications. International Journal of Grid and Utility Computing 5(4), 249–262 (2014)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comp. Surv. 38(2), 6 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Song, X., Jiang, K., Yang, Y. (2017). A Heuristically Optimized Partitioning Strategy on Elias-Fano Index. In: Xhafa, F., Barolli, L., Amato, F. (eds) Advances on P2P, Parallel, Grid, Cloud and Internet Computing. 3PGCIC 2016. Lecture Notes on Data Engineering and Communications Technologies, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-319-49109-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-49109-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49108-0
Online ISBN: 978-3-319-49109-7
eBook Packages: EngineeringEngineering (R0)