Skip to main content

Inverted Index Compression

  • Reference work entry
  • First Online:
Encyclopedia of Big Data Technologies

Definitions

The data structure at the core of nowadays large-scale search engines, social networks, and storage architectures is the inverted index. Given a collection of documents, consider for each distinct term t appearing in the collection the integer sequence â„“t, listing in sorted order all the identifiers of the documents (docIDs in the following) in which the term appears. The sequence â„“t is called the inverted list or posting list of the term t. The inverted index is the collection of all such lists.

The scope of the entry is the one of surveying the most important encoding algorithms developed for efficient inverted index compression and fast retrieval.

Overview

The inverted index owes its popularity to the efficient resolution of queries, expressed as a set of terms {t1, …, tk} combined with a query operator. The simplest operators are Boolean AND and OR. For example, given an AND query, the index has to report all the docIDs of the documents containing the terms {t1, …, tk}....

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 849.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 999.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Anh VN, Moffat A (2005) Inverted index compression using word-aligned binary codes. Inf Retr J 8(1):151–166

    Article  Google Scholar 

  • Anh VN, Moffat A (2010) Index compression using 64-bit words. Softw Pract Exp 40(2):131–147

    Google Scholar 

  • Claude F, Fariña A, Martínez-Prieto MA, Navarro G (2016) Universal indexes for highly repetitive document collections. Inf Syst 61:1–23

    Article  Google Scholar 

  • Dean J (2009) Challenges in building large-scale information retrieval systems: invited talk. In: Proceedings of the 2nd international conference on web search and data mining (WSDM)

    Google Scholar 

  • Delbru R, Campinas S, Tummarello G (2012) Searching web data: an entity retrieval and high-performance indexing model. J Web Semant 10:33–58

    Article  Google Scholar 

  • Dhulipala L, Kabiljo I, Karrer B, Ottaviano G, Pupyrev S, Shalita A (2016) Compressing graphs and indexes with recursive graph bisection. In: Proceedings of the 22nd international conference on knowledge discovery and data mining (SIGKDD), pp 1535–1544

    Google Scholar 

  • Elias P (1974) Efficient storage and retrieval by content and address of static files. J ACM 21(2):246–260

    Article  MathSciNet  MATH  Google Scholar 

  • Elias P (1975) Universal codeword sets and representations of the integers. IEEE Trans Inf Theory 21(2):194–203

    Article  MathSciNet  MATH  Google Scholar 

  • Fano RM (1971) On the number of bits required to implement an associative memory. Memorandum 61. Computer Structures Group, MIT, Cambridge

    Google Scholar 

  • Goldstein J, Ramakrishnan R, Shaft U (1998) Compressing relations and indexes. In: Proceedings of the 14th international conference on data engineering (ICDE), pp 370–379

    Google Scholar 

  • Golomb S (1966) Run-length encodings. IEEE Trans Inf Theory 12(3):399–401

    Article  MATH  Google Scholar 

  • Larsson NJ, Moffat A (1999) Offline dictionary-based compression. In: Data compression conference (DCC), pp 296–305

    Google Scholar 

  • Lemire D, Boytsov L (2013) Decoding billions of integers per second through vectorization. Softw Pract Exp 45(1):1–29

    Article  Google Scholar 

  • Lemire D, Kurz N, Rupp C (2018) Stream-VByte: faster byte-oriented integer compression. Inf Process Lett 130:1–6

    Article  MathSciNet  MATH  Google Scholar 

  • Moffat A, Stuiver L (2000) Binary interpolative coding for effective index compression. Inf Retr J 3(1): 25–47

    Article  Google Scholar 

  • Navarro G, Mäkinen V (2007) Compressed full-text indexes. ACM Comput Surv 39(1):1–79

    Article  MATH  Google Scholar 

  • Ottaviano G, Venturini R (2014) Partitioned elias-fano indexes. In: Proceedings of the 37th international conference on research and development in information retrieval (SIGIR), pp 273–282

    Google Scholar 

  • Ottaviano G, Tonellotto N, Venturini R (2015) Optimal space-time tradeoffs for inverted indexes. In: Proceedings of the 8th annual international ACM conference on web search and data mining (WSDM), pp 47–56

    Google Scholar 

  • Pibiri GE, Venturini R (2017) Clustered Elias-Fano indexes. ACM Trans Inf Syst 36(1):1–33. ISSN 1046-8188

    Article  Google Scholar 

  • Plaisance J, Kurz N, Lemire D (2015) Vectorized VByte decoding. In: International symposium on web algorithms (iSWAG)

    Google Scholar 

  • Rice R, Plaunt J (1971) Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Trans Commun 16(9):889–897

    Article  Google Scholar 

  • Salomon D (2007) Variable-length codes for data compression. Springer, London

    Book  MATH  Google Scholar 

  • Silvestri F (2007) Sorting out the document identifier assignment problem. In: Proceedings of the 29th European conference on IR research (ECIR), pp 101–112

    Google Scholar 

  • Silvestri F, Venturini R (2010) Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: Proceedings of the 19th international conference on information and knowledge management (CIKM), pp 1219–1228

    Google Scholar 

  • Stepanov A, Gangolli A, Rose D, Ernst R, Oberoi P (2011) Simd-based decoding of posting lists. In: Proceedings of the 20th international conference on information and knowledge management (CIKM), pp 317–326

    Google Scholar 

  • Vigna S (2013) Quasi-succinct indices. In: Proceedings of the 6th ACM international conference on web search and data mining (WSDM), pp 83–92

    Google Scholar 

  • Witten I, Moffat A, Bell T (1999) Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

  • Yan H, Ding S, Suel T (2009) Inverted index compression and query processing with optimized document ordering. In: Proceedings of the 18th international conference on world wide web (WWW), pp 401–410

    Google Scholar 

  • Zhang Z, Tong J, Huang H, Liang J, Li T, Stones RJ, Wang G, Liu X (2016) Leveraging context-free grammar for efficient inverted index compression. In: Proceedings of the 39th international conference on research and development in information retrieval (SIGIR), pp 275–284

    Google Scholar 

  • Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343

    Article  MathSciNet  MATH  Google Scholar 

  • Zukowski M, Héman S, Nes N, Boncz P (2006) Super-scalar RAM-CPU cache compression. In: Proceedings of the 22nd international conference on data engineering (ICDE), pp 59–70

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giulio Ermanno Pibiri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Pibiri, G.E., Venturini, R. (2019). Inverted Index Compression. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_52

Download citation

Publish with us

Policies and ethics