Design a System to Detect Plagiarism

Question

Accepted Answer

To design a plagiarism detection system for millions of documents, we first clarify that exact string matching is infeasible due to scale. We need an approximate solution that identifies near-duplicates efficiently.

First, during ingestion, we apply shingling to each document. We slide a window of size k over the text to generate overlapping n-grams, converting the document into a set of unique tokens. This handles minor edits and reordering. Next, we compute MinHash signatures for these sets. By applying multiple hash functions, we create a compact signature vector that preserves the Jaccard similarity between any two documents. A small signature allows us to store millions of documents in memory or fast SSDs.

For indexing, we cannot compare every pair of signatures. Instead, we use Locality Sensitive Hashing (LSH). We divide the MinHash signature into bands and hash each band. Documents falling into the same bucket are candidates for comparison. This reduces the complexity from quadratic to nearly linear.

Finally, we implement a verification step. Candidate pairs undergo a precise edit-distance calculation or cosine similarity check on the original shingles to filter false positives. To support Google's scale, this pipeline runs on a distributed framework like MapReduce or Flink, ensuring horizontal scalability. We must also consider incremental updates, allowing new documents to be fingerprinted without re-indexing the entire corpus.

Design a System to Detect Plagiarism

Why Interviewers Ask This

How to Answer This Question

Key Points to Cover

Sample Answer

Common Mistakes to Avoid

Sound confident on this question in 5 minutes

Related Interview Questions

Design a Payment Processing System

Design a System for Real-Time Fleet Management

Design a CDN Edge Caching Strategy