Design a System for Data Sharding and Indexing

Question

Accepted Answer

To design this system, I would first define constraints: we need sub-10ms reads with infinite write scalability. I propose a three-layer architecture: a Metadata Service storing shard topology, a Shard Manager orchestrating physical nodes, and a Global Index Service maintaining searchable keys across all shards. For sharding, I recommend consistent hashing with virtual nodes to ensure even data distribution and minimal rebalancing when new nodes join. When a shard reaches capacity, the system triggers an automated split. The Splitter process divides the key range, assigns new ranges to fresh nodes, and streams incremental changes from the old shard to the new ones until they are fully synced. Crucially, the Global Index must be updated asynchronously. We use a write-ahead log where insertions append to a local queue; a background worker consumes these logs to update the global inverted index. This ensures the primary database remains available while the index catches up. If a shard fails, the Metadata Service redirects traffic to replicas, and the Indexer automatically rebuilds missing entries from the WAL. This approach aligns with Amazon's principle of 'invent and simplify' by automating complex scaling tasks behind a simple API.

Design a System for Data Sharding and Indexing

Why Interviewers Ask This

How to Answer This Question

Key Points to Cover

Sample Answer

Common Mistakes to Avoid

Sound confident on this question in 5 minutes

Related Interview Questions

Design a Payment Processing System

Design a System for Real-Time Fleet Management

Design a CDN Edge Caching Strategy