Design a Distributed Queue (Kafka/SQS)

System Design
Medium
Amazon
23.9K views

Explain the architecture of a distributed message queue. Discuss consumers, producers, partitioning, offset management, and at-least-once vs. exactly-once delivery.

Why Interviewers Ask This

Interviewers at Amazon ask this to evaluate your ability to design scalable, fault-tolerant systems that handle high-throughput data streams. They specifically assess your understanding of trade-offs between consistency and availability, your grasp of partitioning strategies for parallelism, and your knowledge of delivery semantics like at-least-once versus exactly-once in real-world distributed environments.

How to Answer This Question

1. Clarify requirements: Ask about throughput, latency needs, retention policies, and whether ordering matters within partitions or globally. 2. Define core components: Briefly outline producers, brokers, consumers, and the underlying storage mechanism like log-structured merge trees. 3. Explain partitioning strategy: Describe how topics are split into partitions for parallelism and how keys determine message routing. 4. Discuss offset management: Explain how consumers track progress via committed offsets and what happens during rebalancing. 5. Address delivery guarantees: Compare at-least-once (idempotency required) vs. exactly-once (transactional writes), noting Amazon's preference for practical solutions over theoretical perfection.

Key Points to Cover

  • Explicitly mention partitioning as the mechanism for horizontal scalability and parallel consumption
  • Explain offset tracking as the foundation for consumer progress and fault tolerance
  • Distinguish clearly between at-least-once (practical) and exactly-once (complex) delivery models
  • Reference Amazon's focus on idempotency over complex transactional guarantees
  • Discuss rebalancing protocols and how they impact consumer group stability

Sample Answer

To design a distributed queue like Kafka or SQS, I start by clarifying requirements. For Amazon-scale workloads, we need high throughput, low latency, and durability. The system consists of producers sending messages to topics, which are split into partitions across multiple broker nodes. Each partition is an ordered, immutable log where messages are appended sequentially. Producers use a key-based hashing algorithm to route messages to specific partitions, ensuring order for related events while allowing parallel processing across partitions. Consumers subscribe to partitions and maintain offsets, which represent their read position. If a consumer fails, another takes over its partitions using a rebalancing protocol. Regarding delivery, at-least-once is standard; we ensure idempotency on the consumer side to handle duplicates from retries. Exactly-once requires transactional semantics with two-phase commits, adding complexity and latency. For most Amazon services, at-least-once with idempotent operations provides the best balance of performance and reliability. We also implement dead-letter queues for failed messages and monitor lag metrics to prevent backpressure issues.

Common Mistakes to Avoid

  • Confusing topic-level ordering with partition-level ordering, leading to incorrect assumptions about global sequence
  • Ignoring the role of replication factors when discussing durability and fault tolerance
  • Overcomplicating exactly-once semantics without acknowledging the operational overhead it introduces
  • Failing to address what happens during consumer failures or cluster node outages

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 73 Amazon questions