Design a System to Detect Plagiarism
Design a service to detect textual similarity across millions of documents. Focus on document fingerprinting (shingling) and indexing (MinHash/LSH).
Why Interviewers Ask This
Interviewers at Google ask this to evaluate your ability to handle massive-scale data problems where exact matching is computationally impossible. They specifically test your knowledge of probabilistic algorithms like MinHash and LSH, your capacity to balance precision with recall, and your skill in designing distributed systems that can process millions of documents efficiently within strict latency constraints.
How to Answer This Question
1. Clarify Requirements: Immediately define scale (document count, size), latency expectations, and the definition of 'similarity' (exact vs. near-duplicate). Ask if real-time or batch processing is required.
2. High-Level Architecture: Propose a pipeline separating ingestion, fingerprinting, indexing, and querying. Emphasize distributed storage like Bigtable or Spanner for scalability.
3. Core Algorithm Design: Detail the shingling process (k-grams) to break text into chunks. Explain how MinHash converts these sets into compact signatures while preserving Jaccard similarity.
4. Scaling via LSH: Describe using Locality Sensitive Hashing to group similar signatures into buckets, avoiding O(N^2) comparisons. Discuss partitioning strategies across clusters.
5. Refinement & Trade-offs: Address false positives/negatives, handling updates, and storage optimization. Conclude by discussing how this aligns with Google's focus on efficient, scalable infrastructure.
Key Points to Cover
- Explicitly mention MinHash and LSH as the core techniques for scaling beyond brute force
- Demonstrate understanding of Shingling (k-grams) as the preprocessing step for text normalization
- Discuss the trade-off between computational efficiency and detection accuracy (false positives)
- Propose a distributed architecture capable of handling petabytes of data and high throughput
- Address the requirement for incremental updates rather than full reprocessing for new documents
Sample Answer
To design a plagiarism detection system for millions of documents, we first clarify that exact string matching is infeasible due to scale. We need an approximate solution that identifies near-duplicates efficiently.
First, during ingestion, we apply shingling to each document. We slide a window of size k over the text to generate overlapping n-grams, converting the document into a set of unique tokens. This handles minor edits and reordering. Next, we compute MinHash signatures for these sets. By applying multiple hash functions, we create a compact signature vector that preserves the Jaccard similarity between any two documents. A small signature allows us to store millions of documents in memory or fast SSDs.
For indexing, we cannot compare every pair of signatures. Instead, we use Locality Sensitive Hashing (LSH). We divide the MinHash signature into bands and hash each band. Documents falling into the same bucket are candidates for comparison. This reduces the complexity from quadratic to nearly linear.
Finally, we implement a verification step. Candidate pairs undergo a precise edit-distance calculation or cosine similarity check on the original shingles to filter false positives. To support Google's scale, this pipeline runs on a distributed framework like MapReduce or Flink, ensuring horizontal scalability. We must also consider incremental updates, allowing new documents to be fingerprinted without re-indexing the entire corpus.
Common Mistakes to Avoid
- Focusing solely on exact string matching algorithms like Rabin-Karp without addressing the impossibility of O(N^2) comparisons at scale
- Ignoring the preprocessing step of shingling, which is critical for detecting paraphrased or slightly modified content
- Overlooking the need for LSH to reduce the search space, leading to a proposal that would crash under real-world load
- Failing to discuss how to handle updates or new document ingestion, resulting in a static system design
Practice This Question with AI
Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.
Related Interview Questions
Design a Payment Processing System
Hard
UberDesign a System for Real-Time Fleet Management
Hard
UberDesign a CDN Edge Caching Strategy
Medium
AmazonDesign a System for Monitoring Service Health
Medium
SalesforceDefining Your Own Success Metrics
Medium
GoogleProduct Strategy: Addressing Market Saturation
Medium
Google