Design a Key-Value Store (Distributed Cache)

System Design
Hard
Amazon
146.7K views

Design a distributed Key-Value store (like Redis or Memcached). Discuss consistency models (CAP theorem), data partitioning using Consistent Hashing, and handling node failures.

Why Interviewers Ask This

Amazon interviewers ask this to evaluate your ability to design scalable, fault-tolerant systems under the CAP theorem constraints. They specifically test your grasp of distributed consistency, partitioning strategies like consistent hashing, and failure recovery mechanisms. This question reveals if you can balance trade-offs between availability and latency while adhering to Amazon's leadership principles of customer obsession and inventing simply.

How to Answer This Question

1. Clarify requirements immediately: Ask about read/write ratios, data size, latency SLAs, and consistency needs (strong vs. eventual) to set the scope. 2. Define the high-level architecture: Sketch a client-side routing layer, a cluster of nodes, and storage backends before diving deep. 3. Address data distribution: Explain consistent hashing in detail, discussing how it minimizes rehashing during node additions or removals compared to modulo hashing. 4. Discuss consistency models: Explicitly reference the CAP theorem; choose a model (e.g., eventual consistency for high availability) and justify it based on your initial assumptions. 5. Handle failures: Describe replication strategies (like ring topology with replicas), leader election, and how the system recovers from node crashes without data loss. 6. Conclude with trade-offs: Summarize why you made specific choices regarding consistency versus performance.

Key Points to Cover

  • Explicitly applying Consistent Hashing to minimize data redistribution during scaling
  • Justifying a choice between Strong and Eventual Consistency using the CAP theorem
  • Describing a concrete replication strategy for handling node failures and data durability
  • Demonstrating understanding of how to eliminate single points of failure in the architecture
  • Aligning technical decisions with Amazon's focus on scalability and customer experience

Sample Answer

To design a distributed key-value store like Redis at Amazon scale, I first clarify that we likely need high write throughput with eventual consistency for user session data. I propose an architecture where clients compute hash rings to route requests directly to specific nodes, eliminating a central bottleneck. For data distribution, I will use consistent hashing. Unlike simple modulo hashing, consistent hashing places keys on a virtual ring, ensuring that when a node fails or is added, only a small fraction of keys need to move to adjacent nodes, maintaining stability during scaling events. Regarding the CAP theorem, given Amazon's global footprint, I prioritize Availability and Partition Tolerance over strong consistency. We will implement asynchronous replication across multiple nodes within a region, allowing reads to proceed even if some nodes are unreachable. To handle node failures, each key is replicated to N neighbors on the ring. If a primary node fails, the system automatically promotes a replica to primary using a lightweight consensus protocol or vector clocks to resolve conflicts. This approach ensures the system remains operational during network partitions while keeping latency low for customers reading their session data.

Common Mistakes to Avoid

  • Focusing solely on single-node implementation without addressing distributed coordination challenges
  • Suggesting strong consistency for all operations without acknowledging the resulting latency penalties
  • Ignoring the mechanics of consistent hashing and proposing simple modulo-based partitioning
  • Overlooking how to detect and recover from node failures without causing data inconsistency

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 73 Amazon questions