Design a Distributed Queue (Kafka/SQS)

Question

Accepted Answer

To design a distributed queue like Kafka or SQS, I start by clarifying requirements. For Amazon-scale workloads, we need high throughput, low latency, and durability. The system consists of producers sending messages to topics, which are split into partitions across multiple broker nodes. Each partition is an ordered, immutable log where messages are appended sequentially. Producers use a key-based hashing algorithm to route messages to specific partitions, ensuring order for related events while allowing parallel processing across partitions. Consumers subscribe to partitions and maintain offsets, which represent their read position. If a consumer fails, another takes over its partitions using a rebalancing protocol. Regarding delivery, at-least-once is standard; we ensure idempotency on the consumer side to handle duplicates from retries. Exactly-once requires transactional semantics with two-phase commits, adding complexity and latency. For most Amazon services, at-least-once with idempotent operations provides the best balance of performance and reliability. We also implement dead-letter queues for failed messages and monitor lag metrics to prevent backpressure issues.

Design a Distributed Queue (Kafka/SQS)

Why Interviewers Ask This

How to Answer This Question

Key Points to Cover

Sample Answer

Common Mistakes to Avoid

Sound confident on this question in 5 minutes

Related Interview Questions

Design a CDN Edge Caching Strategy

Design a System for Monitoring Service Health

Design a Payment Processing System