In general, CopySpider has its operation described by the following process:
- Web search.
- Web links filter.
- Download of candidates.
- Collusion analysis.
To understand each of these steps, here is a description with more information about the internal technique of CopySpider.
Upon receiving an input document to be tested, CopySpider parses the entire document and creates a data structure with relevant information and that will be utilized to search for similar documents. This analysis technique is called fingerprinting. The more representative the fingerprint of a document, more responsive and accurate the search for similar documents, enabling the discovery of unauthorized copies.
- Web search
With the fingerprint of the input document, CopySpider performs a series of queries on the internet. Each query is analyzed and the results are stored in a second data structure, which is called web links filter.
- Web links filter
With the set of web links from Web Search step, CopySpider applies a filter selection of the most relevant results, identifying what are the documents that have the greatest chance of being similar to the input document. These documents are called candidates.
- Download of candidates
For each document candidate, CopySpider tries to download the content to a local file.
- Collusion analysis
With the downloaded candidates, CopySpider makes a second comparison process by applying a technique called collusion analysis. This second comparison step is very fast because all candidates's information are in memory. This step determines important difference in CopySpider's technique because it increases the accuracy of the identification of similarities results. This second comparison by collusion analysis is detailed, fast, held in memory and does not compromise the total computational time.
Finally, CopySpider provides reports with color highlights between the input document and the candidates' content. Similar chunks of text are highlighted to support user's decision analysis. Reports improves the classifying task of checking the input document as having, or not, chunks literal copies or excerpts with incorrect references.