Theory and applications of b-bit minwise hashing

Ping Li, Arnd Christian König

Research output: Contribution to journalArticle

55 Scopus citations

Abstract

Efficient (approximate) computation of set similarity in very large datasets is a common task with many applications in information retrieval and data management. One common approach for this task is minwise hashing. This paper describes b-bit minwise hashing, which can provide an order of magnitude improvements in storage requirements and computational overhead over the original scheme in practice. We give both theoretical characterizations of the performance of the new algorithm as well as a practical evaluation on large real-life datasets and show that these match very closely. Moreover, we provide a detailed comparison with other important alternative techniques proposed for estimating set similarities. Our technique yields a very simple algorithm and can be realized with only minor modifications to the original minwise hashing scheme.

Original languageEnglish (US)
Article number1978566
Pages (from-to)101-109
Number of pages9
JournalCommunications of the ACM
Volume54
Issue number8
DOIs
StatePublished - Aug 1 2011
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Computer Science(all)

Fingerprint Dive into the research topics of 'Theory and applications of b-bit minwise hashing'. Together they form a unique fingerprint.

  • Cite this