Allign: Aligning All-Pair Near-Duplicate Passages in Long Texts

Weiqi Feng, Dong Deng

Research output: Contribution to journalConference articlepeer-review

Abstract

In this paper, we study the problem of aligning all-pair near-duplicate passages in two long texts. A passage is a sequence of consecutive words in a text. It can begin and end with any word in the text, whether around a period or not. Due to the high computation cost of this problem, existing work all compromise to heuristic alignment methods, which can harm the recall of downstream applications, such as deduplication and plagiarism detection. To address this problem, in this paper, we propose a min-hash based method Allign to find all near-duplicate passage pairs in two long texts. Allign generates a few min-hash values for each passage in the texts and reports all the passage pairs sharing enough common min-hash values. However, for a pair of texts with n and m words, there are in total O(n2m2) passage pairs (each text contains O(n2) and O(m2) passages respectively). Thus it is prohibitively expensive to enumerate all passage pairs in two texts and count their common min-hash values. To address this issue, Allign packs a large number of nearby and overlapping passages with the same min-hash to a "compact window". In total, Allign generates O(n) compact windows to represent all the O(n2) passages in a text with n words. Next, a pair of compact windows in two texts are matched if they have the same min-hash. The rest of unmatched compact windows are removed. Finally, Allign reports all the passage pairs contained by enough number of matched compact window pairs, which must share the same enough number of min-hash values. In this way, Allign avoids enumerating the enormous number of passage pairs. Last but not least, to make the reported near-duplicate passages more relevant and Allign more efficient, we show how to support a few practical constraints efficiently, including reporting only longest near-duplicates and sentence-level near-duplicates. Experimental results on real-world datasets show that Allign significantly outperforms the state-of-the-art text alignment methods.

Original languageEnglish (US)
Pages (from-to)541-553
Number of pages13
JournalProceedings of the ACM SIGMOD International Conference on Management of Data
DOIs
StatePublished - 2021
Externally publishedYes
Event2021 International Conference on Management of Data, SIGMOD 2021 - Virtual, Online, China
Duration: Jun 20 2021Jun 25 2021

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems

Keywords

  • deduplication
  • near-duplicate detection
  • plagiarism detection
  • text alignment
  • text curation

Fingerprint

Dive into the research topics of 'Allign: Aligning All-Pair Near-Duplicate Passages in Long Texts'. Together they form a unique fingerprint.

Cite this