Design a News Aggregator (Google News)

System Design
Medium
Google
85.7K views

Design a service that aggregates articles from various sources. Focus on scraping/crawling, article clustering, and deduplication.

Why Interviewers Ask This

Interviewers ask this to evaluate your ability to architect scalable distributed systems handling massive unstructured data. They specifically assess your understanding of web crawling constraints, real-time deduplication strategies using hashing or embeddings, and clustering algorithms for grouping similar stories. At Google, they also look for your capacity to balance trade-offs between consistency, latency, and cost in a high-throughput environment.

How to Answer This Question

1. Clarify requirements by defining scale (articles per second), freshness (real-time vs. batch), and core features like deduplication and clustering. 2. Estimate system capacity using back-of-the-envelope calculations for storage and bandwidth. 3. Design the high-level architecture starting with a crawler layer that respects robots.txt and rate limits. 4. Detail the ingestion pipeline, focusing on text normalization and similarity detection using MinHash or LSH for efficient deduplication. 5. Explain the clustering logic, perhaps using hierarchical agglomerative clustering or vector embeddings to group related articles. 6. Discuss infrastructure choices like Bigtable for storage and Spanner for consistency, aligning with Google's internal tooling preferences while explaining why.

Key Points to Cover

  • Explicitly addressing the challenge of detecting near-duplicate content beyond simple string matching
  • Proposing specific algorithms like MinHash or LSH for efficient large-scale deduplication
  • Demonstrating knowledge of Google-scale infrastructure patterns such as sharding and async queues
  • Balancing the trade-off between real-time processing speed and computational cost
  • Including a concrete strategy for handling crawl politeness and source reliability

Sample Answer

To design a News Aggregator like Google News, I'd first clarify the scope: we need to handle millions of sources with sub-second latency for breaking news. The system starts with a distributed crawler, likely built on Go, which prioritizes sites based on update frequency and uses politeness policies to avoid blocking. Once articles are fetched, they enter an ingestion pipeline where we normalize text—removing boilerplate and standardizing encoding. For deduplication, exact matches are caught via SHA-256 hashes, but near-duplicates require a more sophisticated approach. I'd propose using MinHash and Locality Sensitive Hashing (LSH) to efficiently identify semantically similar articles across different headlines. After filtering duplicates, the remaining articles undergo clustering. Using TF-IDF vectors or pre-trained BERT embeddings, we can group articles by topic. We might use K-Means for broad categories and DBSCAN for outlier detection to find unique angles. Finally, these clusters are indexed in a search engine like Solr or Elasticsearch, allowing users to query by topic. To ensure scalability, we'd leverage sharding for the crawler and asynchronous queues like Pub/Sub to decouple ingestion from processing, ensuring the system remains robust even during major global events.

Common Mistakes to Avoid

  • Ignoring the 'robots.txt' protocol and legal implications of aggressive web scraping
  • Focusing solely on database schema without explaining how to process unstructured text
  • Suggesting O(n^2) comparison algorithms for deduplication which won't scale to billions of articles
  • Overlooking the need for a feedback loop to improve clustering accuracy over time

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 87 Google questions