Skip to main content

Yet Another Sorting-Based Solution to the Reassignment of Document Identifiers

  • Conference paper
Information Retrieval Technology (AIRS 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7675))

Included in the following conference series:

Abstract

Inverted file is generally used in search engines such as Web Search and Library Search, etc. Previous work demonstrated that the compressed size of inverted file can be significantly reduced through the reassignment of document identifiers. There are two main state-of-the-art solutions: URL sorting-based solution, which sorts the documents by the alphabetical order of the URLs; and TSP-based solution, which considers the reassignment as Traveling Salesman Problem. These techniques achieve good compression, while have significant limitations on the URLs and data size. In this paper, we propose an efficient solution to the reassignment problem that first sorts the terms in each document by document frequency and then sorts the documents by the presence of the terms. Our approach has few restrictions on data sets and is applicable to various situations. Experimental results on four public data sets show that compared with the TSP-based approach, our approach reduces the time complexity from O(n 2) to \(O(\overline{|D|} \cdot n\log n)\) (\(\overline{|D|}\): average length of n documents), while achieving comparative compression ratio; and compared with the URL-sorting based approach, our approach improves the compression ratio up to 10.6% with approximately the same run-time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  2. Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: SIGIR, pp. 222–229. ACM (2002)

    Google Scholar 

  3. Moffat, A., Stuiver, L.: Binary interpolative coding for effective index compression. Inf. Retr. 3(1), 25–47 (2000)

    Article  Google Scholar 

  4. Williams, H.E., Zobel, J.: Compressing integers for fast file access. Comput. J. 42(3), 193–201 (1999)

    Article  Google Scholar 

  5. Yan, H., Ding, S., Suel, T.: Inverted index compression and query processing with optimized document ordering. In: WWW, pp. 401–410. ACM (2009)

    Google Scholar 

  6. Zukowski, M., Héman, S., Nes, N., Boncz, P.: Super-scalar ram-cpu cache compression. In: ICDE, p. 59. IEEE Computer Society Press (2006)

    Google Scholar 

  7. Elias, P.: Universal codeword sets and representations of the integers. IEEE Transactions Information Theory 21(2), 194–203 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  8. Silvestri, F., Venturini, R.: Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: CIKM, pp. 1219–1228. ACM (2010)

    Google Scholar 

  9. Rice, R., Plaunt, J.: Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Transactions Communication Technology 19(6), 889–897 (1971)

    Article  Google Scholar 

  10. Blandford, D.K., Blelloch, G.E.: Index compression through document reordering. In: DCC, pp. 342–351. IEEE Computer Society (2002)

    Google Scholar 

  11. Shieh, W.-Y., Chen, T.-F., Shann, J.J.-J., Chung, C.-P.: Inverted file compression through document identifier reassignment. Inf. Process. Manage. 39(1), 117–131 (2003)

    Article  MATH  Google Scholar 

  12. Silvestri, F.: Sorting Out the Document Identifier Assignment Problem. In: Amati, G., Carpineto, C., Romano, G. (eds.) ECIR 2007. LNCS, vol. 4425, pp. 101–112. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  13. Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Inf. Retr. 8(1), 151–166 (2005)

    Article  Google Scholar 

  14. Anh, V.N., Moffat, A.: Index compression using 64-bit words. Software: Practice and Experience 40(2), 131–147 (2010)

    Google Scholar 

  15. Silvestri, F., Orlando, S., Perego, R.: Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In: SIGIR, pp. 305–312. ACM (2004)

    Google Scholar 

  16. Blanco, R., Barreiro, A.: Document Identifier Reassignment Through Dimensionality Reduction. In: Losada, D.E., Fernández-Luna, J.M. (eds.) ECIR 2005. LNCS, vol. 3408, pp. 375–387. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  17. Ding, S., Attenberg, J., Suel, T.: Scalable techniques for document identifier assignment ininverted indexes. In: WWW, pp. 311–320. ACM (2010)

    Google Scholar 

  18. Blanco, R., Barreiro, A.: Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem. In: SIGIR, pp. 587–588. ACM (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Shi, L., Wang, B. (2012). Yet Another Sorting-Based Solution to the Reassignment of Document Identifiers. In: Hou, Y., Nie, JY., Sun, L., Wang, B., Zhang, P. (eds) Information Retrieval Technology. AIRS 2012. Lecture Notes in Computer Science, vol 7675. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35341-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35341-3_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35340-6

  • Online ISBN: 978-3-642-35341-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics