Shingle method

This article discusses the methods for identifying duplicate documents to prevent their inclusion into collections; it also analyses the approaches for finding near-duplicate documents based on the method of shingles to determine spam e-mail, search of plagiarism, to clean collections' documents from duplicates. This paper also presents analysis methods and parameter selection of shingle algorithm, criteria of the checksums (signatures) selection. There was developed a program for identifying duplicates, and proposed the criteria for selecting the optimization algorithm of shingles with using minhash and supershingles algorithm.


Shingles, supershingles, fuzzy duplicates, similarity of texts, algorithm of shingles

