Explain the Raft Consensus Algorithm

System Design
Hard
Meta
140K views

Explain the Raft consensus algorithm (or Paxos). Focus on the three major subproblems: Leader Election, Log Replication, and Safety/Consistency. Discuss its role in systems like etcd.

Why Interviewers Ask This

Interviewers at Meta ask this to evaluate your grasp of distributed system reliability and your ability to reason about complex state machine replication. They specifically test if you understand how to achieve consistency in the presence of network partitions and node failures, which is critical for their infrastructure like F5 or internal data stores.

How to Answer This Question

1. Start with a high-level definition: Raft ensures consistency by electing a single leader to manage log replication. 2. Break down the three core subproblems explicitly: Leader Election, Log Replication, and Safety/Consistency. 3. For Leader Election, explain the role states (Follower, Candidate, Leader) and the election timeout mechanism. 4. Detail Log Replication by describing how clients send requests to the leader, who appends entries and replicates them to followers before committing. 5. Conclude with Safety properties, emphasizing that committed entries are never lost and leaders only commit entries from their own terms. Avoid getting bogged down in mathematical proofs; focus on the operational flow and failure scenarios relevant to systems like etcd.

Key Points to Cover

  • Explicitly identifying the three subproblems: Leader Election, Log Replication, and Safety
  • Explaining the term-based logic and majority voting requirements for elections
  • Clarifying that only the Leader can append entries to the log
  • Defining 'committed' as replicated to a majority of nodes
  • Connecting the theory to real-world usage in systems like etcd

Sample Answer

Raft is a consensus algorithm designed to be easier to understand than Paxos while ensuring strong consistency in distributed systems. It solves three main problems: Leader Election, Log Replication, and Safety. First, in Leader Election, nodes start as Followers. If a node doesn't hear from a leader within an election timeout, it becomes a Candidate, increments its term, and votes for itself. It then requests votes from other nodes. If it receives a majority, it becomes the Leader. This prevents split-brain scenarios. Second, Log Replication works because all client requests go through the Leader. The Leader appends the entry to its log and sends AppendEntries RPCs to Followers. Once a majority acknowledges the entry, it is considered committed and applied to the state machine. Third, regarding Safety, Raft guarantees that if an entry is committed in one term, no other entry from a different term can overwrite it. The Leader also ensures it only contains committed entries from previous terms. At Meta, where systems like etcd handle critical configuration data, understanding this separation of concerns helps engineers design resilient services that survive network partitions without losing data integrity.

Common Mistakes to Avoid

  • Confusing Raft with Paxos by ignoring the distinct leader-centric approach
  • Failing to mention that uncommitted entries can be overwritten during leadership changes
  • Omitting the concept of 'terms' which prevent stale leaders from making decisions
  • Describing log replication without explaining how safety is maintained across node failures

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 71 Meta questions