Skip to main content

A Heuristically Optimized Partitioning Strategy on Elias-Fano Index

  • Conference paper
  • First Online:
Advances on P2P, Parallel, Grid, Cloud and Internet Computing (3PGCIC 2016)

Abstract

Inverted index is the preferred data structure for various query processing in large information systems, its compression techniques have long been studied to mitigate the dichotomy between space occupancy and decompression time. During compression, partitioning posting list into blocks aligning to its clustered distribution, can effectively minimize the compressed size while keeping partitions separately accessed. Traditional partitioning strategies using fixed-sized blocks trend to be easy to implement, but their compression effectiveness is vulnerable to outliers. Recently researchers begin to apply dynamic programming to determine optimal partitions with variable-sized blocks. However, these partitioning strategies sacrifice too much compression time. In this paper, we first compare performances of existing encoders in the space-time trade-off curve, then we present a faster algorithm to heuristically compute optimal partitions for the state-of-the-art Partitioned Elias-Fano index, taking account of compression time while maintaining the same approximation guarantees. Experimental results on TREC GOV2 document collection show that our method makes a significant improvement against its original version.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anh, V.N., Moffat, A.: Index compression using fixed binary codewords. In: Proc. ADC. pp. 61–67 (2004)

    Google Scholar 

  2. Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Inform. Retrieval 8(1), 151–166 (2005)

    Google Scholar 

  3. Anh, V.N., Moffat, A.: Index compression using 64-bit words. Soft. Prac. & Exp. 40(2), 131–147 (2010)

    Google Scholar 

  4. Catena, M., Macdonald, C., Ounis, I.: On inverted index compression for search engine efficiency. In: Proc. ECIR. pp. 359–371 (2014)

    Google Scholar 

  5. Delbru, R., Campinas, S., Tummarello, G.: Searching web data: An entity retrieval and high-performance indexing model. J. Web Sem 10, 33–58 (2012)

    Google Scholar 

  6. Elias, P.: Efficient storage and retrieval by content and address of static files. Journal of the ACM (JACM) 21(2), 246–260 (1974)

    Google Scholar 

  7. Lai, C., Moulin, C.: Semantic indexing modelling of resources within a distributed system. International Journal of Grid and Utility Computing 4(1), 21–39 (2013)

    Google Scholar 

  8. Lemire, D., Boytsov, L.: Decoding billions of integers per second through vectorization. Soft. Prac. & Exp 45(1), 1–29 (2015)

    Google Scholar 

  9. Manning, C.D., Raghavan, P., Schütze, H., et al.: Introduction to information retrieval. Cambridge university press (2008)

    Google Scholar 

  10. Navarro, G., Puglisi, S.J.: Dual-sorted inverted lists. In: Proc. SPIRE. pp. 309–321 (2010)

    Google Scholar 

  11. Ottaviano, G., Tonellotto, N., Venturini, R.: Optimal space-time tradeoffs for inverted indexes. In: Proc. WSDM. pp. 47–56 (2015)

    Google Scholar 

  12. Ottaviano, G., Venturini, R.: Partitioned elias-fano indexes. In: Proc. SIGIR. pp. 273–282 (2014)

    Google Scholar 

  13. Petri, M., Moffat, A., Culpepper, J.S.: Score-safe term-dependency processing with hybrid indexes. In: Proc. SIGIR. pp. 899–902 (2014)

    Google Scholar 

  14. Shorfuzzaman, M., Graham, P., Eskicioglu, R.: Allocating replicas in large-scale data grids using a qos-aware distributed technique with workload constraints. International Journal of Grid and Utility Computing 3(2-3), 157–174 (2012)

    Google Scholar 

  15. Silvestri, F., Venturini, R.: Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: Proc. CIKM. pp. 1219–1228 (2010)

    Google Scholar 

  16. Stepanov, A.A., Gangolli, A.R., Rose, D.E., Ernst, R.J., Oberoi, P.S.: Simd-based decoding of posting lists. In: Proc. CIKM. pp. 317–326 (2011)

    Google Scholar 

  17. Trotman, A.: Compression, simd, and postings lists. In: Proc. ADCS. p. 50 (2014)

    Google Scholar 

  18. Tudor, D., Macariu, G., Schreiner, W., Cretu, V.I.: Experiences on grid shared data programming. International Journal of Grid and Utility Computing 1(4), 296–307 (2009)

    Google Scholar 

  19. Vigna, S.: Quasi-succinct indices. In: Proc. WSDM. pp. 83–92 (2013)

    Google Scholar 

  20. Wang, Y., Ma, J., Lu, X., Lu, D., Zhang, L.: Efficiency optimisation signature scheme for time-critical multicast data origin authentication. International Journal of Grid and Utility Computing 7(1), 1–11 (2016)

    Google Scholar 

  21. Witten, I.H., Moffat, A., Bell, T.C.: Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann (1999)

    Google Scholar 

  22. Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: Proc. WWW. pp. 401–410 (2009)

    Google Scholar 

  23. Zhang, T., Cui, L., Xu, M.: A lns-based data placement strategy for data-intensive e-science applications. International Journal of Grid and Utility Computing 5(4), 249–262 (2014)

    Google Scholar 

  24. Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comp. Surv. 38(2), 6 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xingshen Song .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Song, X., Jiang, K., Yang, Y. (2017). A Heuristically Optimized Partitioning Strategy on Elias-Fano Index. In: Xhafa, F., Barolli, L., Amato, F. (eds) Advances on P2P, Parallel, Grid, Cloud and Internet Computing. 3PGCIC 2016. Lecture Notes on Data Engineering and Communications Technologies, vol 1. Springer, Cham. https://doi.org/10.1007/978-3-319-49109-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-49109-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-49108-0

  • Online ISBN: 978-3-319-49109-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics