Design a Distributed Semaphore

System Design
Hard
Stripe
33.5K views

Design a system that acts as a distributed lock or semaphore to control access to a shared resource across multiple servers. Discuss using ZooKeeper or Redis.

Why Interviewers Ask This

Interviewers at Stripe ask this to evaluate your ability to handle distributed consensus and race conditions in high-throughput financial systems. They specifically test your understanding of eventual consistency, leader election, and failure scenarios like network partitions or node crashes. The goal is to see if you can design a solution that guarantees mutual exclusion without creating a single point of failure.

How to Answer This Question

1. Clarify Requirements: Define the semaphore's capacity (e.g., allow N concurrent connections), latency constraints typical for payment processing, and durability needs. 2. Choose a Consensus Model: Decide between using Redis with Lua scripts for speed versus ZooKeeper for strong consistency and watch mechanisms. 3. Design the Core Logic: Explain how clients acquire a lock by creating an ephemeral node or incrementing a counter, and release it upon timeout or explicit unlock. 4. Handle Failures: Detail strategies for handling client crashes where locks must be auto-released via TTLs or heartbeat mechanisms to prevent deadlocks. 5. Discuss Trade-offs: Compare the CAP theorem implications of your choice, emphasizing availability for Stripe's global scale versus strict consistency for transaction integrity.

Key Points to Cover

  • Explicitly defining the difference between a lock (binary) and a semaphore (count-based) before starting
  • Demonstrating knowledge of ephemeral nodes and sequential ordering in ZooKeeper
  • Addressing the 'livelock' scenario where a crashed client holds a lock forever
  • Discussing the trade-off between strong consistency (ZooKeeper) and high throughput (Redis)
  • Explaining how to handle network partitions and split-brain scenarios effectively

Sample Answer

To design a distributed semaphore for a system like Stripe, I would first clarify the requirements. We need to limit access to a shared resource, such as a database connection pool or a specific API endpoint, across thousands of servers. The critical constraint is preventing race conditions while maintaining low latency. I propose using Apache ZooKeeper for its strong consistency guarantees, which are vital for financial transactions. Each server attempting to access the resource creates an ephemeral sequential node under a parent directory. The nodes are sorted by their creation order. A server checks if its node index is within the semaphore count; if so, it acquires the lock. If not, it installs a watcher on the node immediately preceding it. When the previous holder releases the lock or crashes, the watcher triggers, allowing the next server to proceed. Alternatively, for lower latency scenarios where eventual consistency is acceptable, I might use Redis with Lua scripts to atomically decrement a counter. This approach is faster but requires careful handling of network partitions. In both designs, we must implement a lease mechanism with a Time-To-Live (TTL) to ensure that if a server crashes without explicitly releasing the lock, the resource becomes available again after the TTL expires. This balances safety with the high availability required in modern distributed systems.

Common Mistakes to Avoid

  • Ignoring the need for automatic lock expiration, leading to permanent deadlocks if a client crashes
  • Focusing solely on the happy path and failing to discuss what happens during a network partition
  • Proposing a central coordinator that acts as a single point of failure without a failover strategy
  • Confusing mutual exclusion locks with semaphores by forgetting to implement the counting logic
  • Overlooking the performance cost of polling instead of using efficient watch/notification mechanisms

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 57 Stripe questions