Inverted Index Compression

Pibiri, Giulio Ermanno; Venturini, Rossano

doi:10.1007/978-3-319-77525-8_52

Giulio Ermanno Pibiri³ &
Rossano Venturini³

18 Accesses
1 Citations

Definitions

The data structure at the core of nowadays large-scale search engines, social networks, and storage architectures is the inverted index. Given a collection of documents, consider for each distinct term t appearing in the collection the integer sequence ℓ_t, listing in sorted order all the identifiers of the documents (docIDs in the following) in which the term appears. The sequence ℓ_t is called the inverted list or posting list of the term t. The inverted index is the collection of all such lists.

The scope of the entry is the one of surveying the most important encoding algorithms developed for efficient inverted index compression and fast retrieval.

Overview

The inverted index owes its popularity to the efficient resolution of queries, expressed as a set of terms {t₁, …, t_k} combined with a query operator. The simplest operators are Boolean AND and OR. For example, given an AND query, the index has to report all the docIDs of the documents containing the terms {t₁, …, t_k}....

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 849.99; Price excludes VAT (USA)

Hardcover Book: USD 999.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anh VN, Moffat A (2005) Inverted index compression using word-aligned binary codes. Inf Retr J 8(1):151–166
Article Google Scholar
Anh VN, Moffat A (2010) Index compression using 64-bit words. Softw Pract Exp 40(2):131–147
Google Scholar
Claude F, Fariña A, Martínez-Prieto MA, Navarro G (2016) Universal indexes for highly repetitive document collections. Inf Syst 61:1–23
Article Google Scholar
Dean J (2009) Challenges in building large-scale information retrieval systems: invited talk. In: Proceedings of the 2nd international conference on web search and data mining (WSDM)
Google Scholar
Delbru R, Campinas S, Tummarello G (2012) Searching web data: an entity retrieval and high-performance indexing model. J Web Semant 10:33–58
Article Google Scholar
Dhulipala L, Kabiljo I, Karrer B, Ottaviano G, Pupyrev S, Shalita A (2016) Compressing graphs and indexes with recursive graph bisection. In: Proceedings of the 22nd international conference on knowledge discovery and data mining (SIGKDD), pp 1535–1544
Google Scholar
Elias P (1974) Efficient storage and retrieval by content and address of static files. J ACM 21(2):246–260
Article MathSciNet MATH Google Scholar
Elias P (1975) Universal codeword sets and representations of the integers. IEEE Trans Inf Theory 21(2):194–203
Article MathSciNet MATH Google Scholar
Fano RM (1971) On the number of bits required to implement an associative memory. Memorandum 61. Computer Structures Group, MIT, Cambridge
Google Scholar
Goldstein J, Ramakrishnan R, Shaft U (1998) Compressing relations and indexes. In: Proceedings of the 14th international conference on data engineering (ICDE), pp 370–379
Google Scholar
Golomb S (1966) Run-length encodings. IEEE Trans Inf Theory 12(3):399–401
Article MATH Google Scholar
Larsson NJ, Moffat A (1999) Offline dictionary-based compression. In: Data compression conference (DCC), pp 296–305
Google Scholar
Lemire D, Boytsov L (2013) Decoding billions of integers per second through vectorization. Softw Pract Exp 45(1):1–29
Article Google Scholar
Lemire D, Kurz N, Rupp C (2018) Stream-VByte: faster byte-oriented integer compression. Inf Process Lett 130:1–6
Article MathSciNet MATH Google Scholar
Moffat A, Stuiver L (2000) Binary interpolative coding for effective index compression. Inf Retr J 3(1): 25–47
Article Google Scholar
Navarro G, Mäkinen V (2007) Compressed full-text indexes. ACM Comput Surv 39(1):1–79
Article MATH Google Scholar
Ottaviano G, Venturini R (2014) Partitioned elias-fano indexes. In: Proceedings of the 37th international conference on research and development in information retrieval (SIGIR), pp 273–282
Google Scholar
Ottaviano G, Tonellotto N, Venturini R (2015) Optimal space-time tradeoffs for inverted indexes. In: Proceedings of the 8th annual international ACM conference on web search and data mining (WSDM), pp 47–56
Google Scholar
Pibiri GE, Venturini R (2017) Clustered Elias-Fano indexes. ACM Trans Inf Syst 36(1):1–33. ISSN 1046-8188
Article Google Scholar
Plaisance J, Kurz N, Lemire D (2015) Vectorized VByte decoding. In: International symposium on web algorithms (iSWAG)
Google Scholar
Rice R, Plaunt J (1971) Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Trans Commun 16(9):889–897
Article Google Scholar
Salomon D (2007) Variable-length codes for data compression. Springer, London
Book MATH Google Scholar
Silvestri F (2007) Sorting out the document identifier assignment problem. In: Proceedings of the 29th European conference on IR research (ECIR), pp 101–112
Google Scholar
Silvestri F, Venturini R (2010) Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: Proceedings of the 19th international conference on information and knowledge management (CIKM), pp 1219–1228
Google Scholar
Stepanov A, Gangolli A, Rose D, Ernst R, Oberoi P (2011) Simd-based decoding of posting lists. In: Proceedings of the 20th international conference on information and knowledge management (CIKM), pp 317–326
Google Scholar
Vigna S (2013) Quasi-succinct indices. In: Proceedings of the 6th ACM international conference on web search and data mining (WSDM), pp 83–92
Google Scholar
Witten I, Moffat A, Bell T (1999) Managing gigabytes: compressing and indexing documents and images, 2nd edn. Morgan Kaufmann, San Francisco
MATH Google Scholar
Yan H, Ding S, Suel T (2009) Inverted index compression and query processing with optimized document ordering. In: Proceedings of the 18th international conference on world wide web (WWW), pp 401–410
Google Scholar
Zhang Z, Tong J, Huang H, Liang J, Li T, Stones RJ, Wang G, Liu X (2016) Leveraging context-free grammar for efficient inverted index compression. In: Proceedings of the 39th international conference on research and development in information retrieval (SIGIR), pp 275–284
Google Scholar
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23(3):337–343
Article MathSciNet MATH Google Scholar
Zukowski M, Héman S, Nes N, Boncz P (2006) Super-scalar RAM-CPU cache compression. In: Proceedings of the 22nd international conference on data engineering (ICDE), pp 59–70
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Pisa, Pisa, Italy
Giulio Ermanno Pibiri & Rossano Venturini

Authors

Giulio Ermanno Pibiri
View author publications
You can also search for this author in PubMed Google Scholar
Rossano Venturini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Giulio Ermanno Pibiri .

Editor information

Editors and Affiliations

Institute of Computer Science, University of Tartu, Tartu, Estonia
Sherif Sakr
School of Information Technologies, Sydney University, Sydney, Australia
Albert Y. Zomaya

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Pibiri, G.E., Venturini, R. (2019). Inverted Index Compression. In: Sakr, S., Zomaya, A.Y. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-77525-8_52

Download citation

DOI: https://doi.org/10.1007/978-3-319-77525-8_52
Published: 20 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77524-1
Online ISBN: 978-3-319-77525-8
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics