Design a System for Monitoring E-commerce Price Changes

System Design
Medium
Amazon
68.1K views

Design a service that continuously scrapes competitor websites, detects price changes, and stores the history efficiently. Focus on politeness and change detection algorithms.

Why Interviewers Ask This

Interviewers at Amazon ask this to evaluate your ability to balance technical scalability with ethical constraints like politeness. They specifically test if you can design a distributed scraping system that handles high concurrency, detects price deltas efficiently without redundant storage, and respects rate limits to avoid IP bans while maintaining data integrity for competitive analysis.

How to Answer This Question

1. Clarify requirements: Define scale (e.g., millions of SKUs), latency needs for real-time alerts, and the specific definition of 'politeness' regarding request rates. 2. Architecture overview: Propose a microservices architecture with a scheduler, scraper workers, a change detection engine, and a time-series database. 3. Politeness strategy: Detail how to implement randomized delays, user-agent rotation, and dynamic backoff algorithms to prevent blocking. 4. Change detection logic: Explain using content hashing or DOM diffing to identify actual price changes versus noise before writing to storage. 5. Data modeling: Describe storing only deltas in a columnar store like DynamoDB or Cassandra to optimize read performance for historical trends. 6. Scalability: Discuss horizontal scaling of scraper nodes and partitioning strategies based on product categories.

Key Points to Cover

  • Explicitly define a polite scraping strategy using dynamic backoff and rate limiting
  • Propose a hash-based change detection mechanism to minimize unnecessary storage writes
  • Select a time-series or partitioned NoSQL database optimized for historical price trends
  • Demonstrate understanding of distributed systems scaling through worker pools and task queues
  • Address the trade-off between data freshness and server load on competitor sites

Sample Answer

To design this system, I would start by defining the scope: monitoring thousands of competitors across millions of products with sub-minute latency. The core architecture would consist of a central Scheduler that assigns URL batches to a fleet of Scraper Workers. To address the 'politeness' requirement, each worker implements an adaptive backoff algorithm. Instead of fixed intervals, we use a token bucket approach where the rate limit dynamically adjusts based on HTTP 429 responses from target sites, ensuring we never overwhelm their infrastructure. For efficiency, we won't store every full page. Instead, the Change Detection Engine computes a hash of the relevant price elements. If the new hash matches the previous one, we skip storage entirely. Only when a delta is detected do we write a record containing the timestamp, old price, new price, and currency to a partitioned NoSQL store like Amazon DynamoDB, optimized for time-range queries. This minimizes storage costs and I/O. Finally, we add a notification layer that triggers alerts via SNS when price drops exceed a configurable threshold, enabling rapid repricing decisions. This approach balances high throughput with strict adherence to web ethics.

Common Mistakes to Avoid

  • Ignoring the 'politeness' constraint and proposing aggressive scraping that would get IPs banned
  • Storing full HTML pages instead of just price deltas, leading to massive storage bloat
  • Failing to explain how to handle race conditions when multiple workers scrape the same item simultaneously
  • Overlooking error handling for network failures or CAPTCHA challenges during the scraping process

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 73 Amazon questions