Design a Distributed File Locking System

System Design
Hard
Google
73K views

Design a system that provides exclusive read/write access to shared files across a distributed cluster. Discuss using ZooKeeper or a dedicated lock service.

Why Interviewers Ask This

Interviewers ask this to evaluate your ability to handle distributed consistency, race conditions, and fault tolerance in high-concurrency environments. They specifically want to see if you understand the CAP theorem trade-offs when designing coordination services like ZooKeeper. The question tests your capacity to translate theoretical consensus algorithms into a practical, production-grade locking mechanism that prevents data corruption across a cluster.

How to Answer This Question

1. Clarify requirements immediately: define if locks are exclusive or shared, synchronous vs asynchronous, and the expected failure modes like network partitions or node crashes. 2. Propose a centralized coordinator using an external service like ZooKeeper or etcd rather than building a custom consensus layer from scratch, as this aligns with Google's preference for leveraging mature infrastructure. 3. Detail the lock acquisition flow: explain how clients create ephemeral sequential nodes to request access and how the system determines the winner based on sequence numbers. 4. Discuss release mechanisms: describe how ephemeral nodes automatically vanish on client death to prevent deadlocks, ensuring liveness. 5. Address edge cases explicitly: cover split-brain scenarios, leader election failures, and how to handle clock skew or network delays during lock contention.

Key Points to Cover

  • Explicitly choosing a proven coordination service like ZooKeeper over building a custom consensus layer
  • Using ephemeral sequential nodes to handle automatic lock release on client failure
  • Explaining the logic of comparing sequence numbers to determine lock ownership fairly
  • Addressing the thundering herd problem through targeted notifications rather than broadcasts
  • Demonstrating understanding of CAP theorem trade-offs regarding availability versus strict consistency

Sample Answer

To design a distributed file locking system, I would first clarify that we need exclusive write locks and potentially shared read locks, with a strong guarantee of safety even under network partitions. Given the scale at Google, I wouldn't build a custom consensus algorithm but instead leverage an existing coordination service like ZooKeeper or etcd. The core strategy involves treating each lock request as creating an ephemeral, sequential z-node in a specific path, say /locks/fileA. When a client wants to lock a file, it creates its node with a unique sequence number. The client then watches the node with the lowest sequence number; if it is itself, it holds the lock. This ensures fairness and prevents starvation. For fault tolerance, the ephemeral nature of these nodes is critical. If a client holding the lock crashes or loses connectivity, the session expires, and the node is automatically deleted by the coordinator. This immediately releases the lock without requiring a timeout-based heartbeat mechanism from the client side, which could lead to false positives. However, we must consider the 'thundering herd' problem where many clients watch the same node. To mitigate this, the winning client should only notify the next immediate successor in the sequence chain rather than broadcasting to everyone. Finally, we handle read-write conflicts by maintaining separate namespaces for read and write locks, ensuring multiple readers can proceed concurrently while writers remain mutually exclusive.

Common Mistakes to Avoid

  • Suggesting a naive timeout-based approach where clients check if a lock is stale, which fails during network partitions
  • Ignoring the difference between exclusive and shared locks, leading to potential data corruption scenarios
  • Failing to mention ephemeral nodes, resulting in deadlocks if the client process crashes unexpectedly
  • Overlooking the performance cost of having every waiting client watch the lock holder instead of chaining notifications

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 87 Google questions