Detecting duplicate biological entities using shortest path edit distance

Alex Rudniy, Min Song, James Geller

Research output: Contribution to journalArticlepeer-review

12 Scopus citations

Abstract

Duplicate entity detection in biological data is an important research task. In this paper, we propose a novel and context-sensitive Shortest Path Edit Distance (SPED) extending and supplementing our previous work on Markov Random Field-based Edit Distance (MRFED). SPED transforms the edit distance computational problem to the calculation of the shortest path among two selected vertices of a graph. We produce several modifications of SPED by applying Levenshtein, arithmetic mean, histogram difference and TFIDF techniques to solve subtasks. We compare SPED performance to other well-known distance algorithms for biological entity matching. The experimental results show that SPED produces competitive outcomes.

Original languageEnglish (US)
Pages (from-to)395-410
Number of pages16
JournalInternational Journal of Data Mining and Bioinformatics
Volume4
Issue number4
DOIs
StatePublished - Jul 2010

All Science Journal Classification (ASJC) codes

  • Information Systems
  • Library and Information Sciences
  • Biochemistry, Genetics and Molecular Biology(all)

Keywords

  • Biological entity matching
  • Duplicate record detection
  • Histogram matching
  • Levenshtein
  • SPED
  • Shortest path edit distance
  • Text mining

Fingerprint

Dive into the research topics of 'Detecting duplicate biological entities using shortest path edit distance'. Together they form a unique fingerprint.

Cite this