Design a News Aggregator (Google News)

Question

Accepted Answer

To design a News Aggregator like Google News, I'd first clarify the scope: we need to handle millions of sources with sub-second latency for breaking news. The system starts with a distributed crawler, likely built on Go, which prioritizes sites based on update frequency and uses politeness policies to avoid blocking. Once articles are fetched, they enter an ingestion pipeline where we normalize text—removing boilerplate and standardizing encoding. For deduplication, exact matches are caught via SHA-256 hashes, but near-duplicates require a more sophisticated approach. I'd propose using MinHash and Locality Sensitive Hashing (LSH) to efficiently identify semantically similar articles across different headlines. After filtering duplicates, the remaining articles undergo clustering. Using TF-IDF vectors or pre-trained BERT embeddings, we can group articles by topic. We might use K-Means for broad categories and DBSCAN for outlier detection to find unique angles. Finally, these clusters are indexed in a search engine like Solr or Elasticsearch, allowing users to query by topic. To ensure scalability, we'd leverage sharding for the crawler and asynchronous queues like Pub/Sub to decouple ingestion from processing, ensuring the system remains robust even during major global events.

Design a News Aggregator (Google News)

Why Interviewers Ask This

How to Answer This Question

Key Points to Cover

Sample Answer

Common Mistakes to Avoid

Sound confident on this question in 5 minutes

Related Interview Questions

Design a CDN Edge Caching Strategy

Design a System for Monitoring Service Health

Design a Payment Processing System