String similarity search and join: a survey

Minghe Yu, Guoliang Li, Dong Deng, Jianhua Feng

Research output: Contribution to journalReview articlepeer-review

61 Scopus citations

Abstract

String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-world applications, such as spell checking, duplicate detection, entity resolution, and webpage clustering. Although these two problems have been extensively studied in the recent decade, there is no thorough survey. In this paper, we present a comprehensive survey on string similarity search and join. We first give the problem definitions and introduce widely-used similarity functions to quantify the similarity. We then present an extensive set of algorithms for string similarity search and join. We also discuss their variants, including approximate entity extraction, type-ahead search, and approximate substring matching. Finally, we provide some open datasets and summarize some research challenges and open problems.

Original languageEnglish (US)
Pages (from-to)399-417
Number of pages19
JournalFrontiers of Computer Science
Volume10
Issue number3
DOIs
StatePublished - Jun 1 2016
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Keywords

  • similarity join
  • similarity search
  • string similarity
  • top-k

Fingerprint

Dive into the research topics of 'String similarity search and join: a survey'. Together they form a unique fingerprint.

Cite this