Top-k string similarity search with edit-distance constraints

Dong Deng, Guoliang Li, Jianhua Feng, Wen Syan Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

56 Scopus citations

Abstract

String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamic-programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.

Original languageEnglish (US)
Title of host publicationICDE 2013 - 29th International Conference on Data Engineering
Pages925-936
Number of pages12
DOIs
StatePublished - 2013
Externally publishedYes
Event29th International Conference on Data Engineering, ICDE 2013 - Brisbane, QLD, Australia
Duration: Apr 8 2013Apr 11 2013

Publication series

NameProceedings - International Conference on Data Engineering

Conference

Conference29th International Conference on Data Engineering, ICDE 2013
Country/TerritoryAustralia
CityBrisbane, QLD
Period4/8/134/11/13

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Signal Processing

Fingerprint

Dive into the research topics of 'Top-k string similarity search with edit-distance constraints'. Together they form a unique fingerprint.

Cite this