Skip to main content

Similarity Sketching

  • Living reference work entry
  • First Online:
Encyclopedia of Big Data Technologies

Synonyms

Distance estimation; Similarity estimation; Similarity summarization

Overview

Similarity between a pair of objects, usually expressed as a similarity score in [0, 1], is a key concept when dealing with noisy or uncertain data, as is common in big data applications.

The aim of similarity sketching is to estimate similarities in a (high-dimensional) space using fewer computational resources (time and/or storage) than a naïve approach that stores unprocessed objects. This is achieved using a form of lossy compression that produces succinct representations of objects in the space, from which similarities can be estimated. In some spaces, it is more natural to consider distances rather than similarities; we will consider both of these measures of proximity in the following.

Definitions

Formally, consider a space X of objects and a function d : X × X →R +. We refer to d as a distance function for X. Similarity sketching with respect to (X, d) is done by using a sketching function c:...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  • Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun ACM 51(1):117–122

    Google Scholar 

  • Broder AZ (1997) On the resemblance and containment of documents. In: Proceedings of compression and complexity of sequences. IEEE, pp 21–29

    Google Scholar 

  • Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw ISDN Syst 29(8):1157–1166

    Google Scholar 

  • Charikar M (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of symposium on theory of computing (STOC), pp 380–388

    Google Scholar 

  • Chierichetti F, Kumar R (2015) Lsh-preserving functions and their applications. J ACM 62(5):33

    Google Scholar 

  • Dahlgaard S, Knudsen MBT, Thorup M (2017) Fast similarity sketching. In: Proceedings of symposium on foundations of computer science (FOCS), pp 663–671

    Google Scholar 

  • Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of conference on very large databases (VLDB), pp 518–529

    Google Scholar 

  • Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128

    Google Scholar 

  • Li P, König AC (2011) Theory and applications of b-bit minwise hashing. Commun ACM 54(8):101–109

    Google Scholar 

  • Li P, Owen AB, Zhang C (2012) One permutation hashing. In: Advances in neural information processing systems (NIPS), pp 3122–3130

    Google Scholar 

  • Mitzenmacher M, Pagh R, Pham N (2014) Efficient estimation for high similarities using odd sketches. In: Proceedings of international world wide web conference (WWW), pp 109–118

    Google Scholar 

  • Rahimi A, Recht B (2007) Random features for large-scale kernel machines. In: Advances in neural information processing systems (NIPS), pp 1177–1184

    Google Scholar 

  • Thorup M (2013) Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In: Proceedings of symposium on theory of computing (STOC). ACM, pp 371–380

    Google Scholar 

  • Wang J, Zhang T, Song J, Sebe N, Shen HT (2017) A survey on learning to hash. IEEE Trans Pattern Anal Mach Intell 13(9) https://doi.org/10.1109/TPAMI.2017.2699960

Download references

Acknowledgements

This work received support from the European Research Council under the European Union’s 7th Framework Programme (FP7/2007-2013)/ ERC grant agreement no. 614331.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rasmus Pagh .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Pagh, R. (2018). Similarity Sketching. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_58-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63962-8_58-1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63962-8

  • Online ISBN: 978-3-319-63962-8

  • eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics