Design a System for Data Sharding and Indexing

System Design
Hard
Amazon
37.5K views

Focus solely on the data layer. Design an automated system that handles creating new database shards and updating the global index transparently as data grows.

Why Interviewers Ask This

Interviewers at Amazon ask this to evaluate your ability to design scalable, automated data architectures that handle massive growth without human intervention. They specifically assess your understanding of consistency models, failure handling during shard rebalancing, and the trade-offs between indexing latency and throughput in a distributed environment.

How to Answer This Question

1. Clarify requirements by defining scale, read/write ratios, and acceptable latency, noting Amazon's focus on customer obsession which implies high availability. 2. Propose a logical architecture separating the metadata service, shard manager, and global index layer before drawing diagrams. 3. Detail the sharding strategy, explaining how you determine key distribution and handle hotspots using consistent hashing or range-based partitioning. 4. Describe the automation workflow for adding shards, emphasizing idempotent operations and zero-downtime migration of data. 5. Explain the global index update mechanism, discussing eventual consistency versus strong consistency trade-offs and how you handle index corruption recovery.

Key Points to Cover

  • Explicitly mention eventual consistency for the global index to maintain high write throughput
  • Describe a specific strategy like Consistent Hashing to prevent data skew during expansion
  • Explain how the system handles partial failures during the shard splitting process
  • Detail the use of Write-Ahead Logs (WAL) to ensure no data is lost during index updates
  • Demonstrate awareness of Amazon's scale by addressing auto-scaling triggers based on load metrics

Sample Answer

To design this system, I would first define constraints: we need sub-10ms reads with infinite write scalability. I propose a three-layer architecture: a Metadata Service storing shard topology, a Shard Manager orchestrating physical nodes, and a Global Index Service maintaining searchable keys across all shards. For sharding, I recommend consistent hashing with virtual nodes to ensure even data distribution and minimal rebalancing when new nodes join. When a shard reaches capacity, the system triggers an automated split. The Splitter process divides the key range, assigns new ranges to fresh nodes, and streams incremental changes from the old shard to the new ones until they are fully synced. Crucially, the Global Index must be updated asynchronously. We use a write-ahead log where insertions append to a local queue; a background worker consumes these logs to update the global inverted index. This ensures the primary database remains available while the index catches up. If a shard fails, the Metadata Service redirects traffic to replicas, and the Indexer automatically rebuilds missing entries from the WAL. This approach aligns with Amazon's principle of 'invent and simplify' by automating complex scaling tasks behind a simple API.

Common Mistakes to Avoid

  • Focusing only on the database schema without designing the control plane for automation
  • Ignoring the performance cost of updating a global index synchronously with every write
  • Failing to address how to handle data migration without locking the entire cluster
  • Overlooking the scenario where a new shard node fails immediately after creation

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 73 Amazon questions