Design a Notification Service (Push/SMS/Email)

System Design
Medium
Meta
134.7K views

Design a generalized system for sending millions of notifications daily. Discuss queuing (Kafka/RabbitMQ), delivery guarantees, and handling failure modes.

Why Interviewers Ask This

Interviewers at Meta ask this to evaluate your ability to design scalable, fault-tolerant distributed systems under high load. They specifically assess how you balance throughput with delivery guarantees when handling millions of daily events. The question tests your understanding of asynchronous processing, message broker selection, and strategies for managing partial failures without blocking the entire system.

How to Answer This Question

1. Clarify requirements immediately: Define scale (requests per second), latency tolerance, and specific delivery guarantees (at-most-once vs. at-least-once) required by different channels like push versus email. 2. Outline the high-level architecture: Propose a flow where client requests hit an API gateway, which enqueues messages into a robust broker like Kafka rather than sending directly. 3. Detail the consumer logic: Explain how worker groups consume from topics, transform data for specific providers (e.g., Firebase for push, Twilio for SMS), and handle retries with exponential backoff. 4. Address failure modes: Discuss idempotency keys to prevent duplicate sends during network glitches and dead-letter queues for unresolvable errors. 5. Optimize for cost and reliability: Mention sharding strategies for partitioning traffic and monitoring metrics like lag and error rates to ensure SLA compliance.

Key Points to Cover

  • Explicitly defining the trade-off between latency and delivery guarantees (at-most-once vs at-least-once)
  • Using a message broker like Kafka to decouple producers from consumers for scalability
  • Implementing idempotency keys to prevent duplicate notifications during retries
  • Designing a Dead Letter Queue (DLQ) pattern for handling permanent failures
  • Addressing backpressure and auto-scaling mechanisms to handle traffic spikes

Sample Answer

To design a notification service capable of handling millions of daily events, I would prioritize decoupling ingestion from delivery using Apache Kafka. First, we define the scope: if we need strict delivery guarantees for critical alerts but can tolerate slight delays for marketing emails, we must differentiate our queue partitions or use separate topics. The API layer accepts requests and pushes them into Kafka with unique correlation IDs to ensure idempotency. This prevents duplicate notifications if a retry occurs due to transient network issues. For the delivery layer, we deploy a cluster of stateless consumers that poll these topics. Each consumer is responsible for a specific channel type, such as Push or Email. If a provider like Firebase returns a 503 error, the consumer implements an exponential backoff strategy before re-queuing the message. Crucially, we implement a Dead Letter Queue (DLQ) for messages failing after maximum retries, allowing manual inspection without halting the pipeline. To handle peak loads, we utilize Kafka's partitioning to parallelize consumption across hundreds of workers. Finally, we monitor key metrics like consumer lag and DLQ depth to trigger auto-scaling policies, ensuring the system remains responsive even during viral events typical in Meta's ecosystem.

Common Mistakes to Avoid

  • Suggesting synchronous HTTP calls for every notification, which creates bottlenecks and blocks user threads
  • Ignoring idempotency, leading to users receiving multiple copies of the same alert during network retries
  • Failing to distinguish between different notification types, treating all messages with identical priority and storage needs
  • Overlooking the need for a Dead Letter Queue, leaving failed messages stuck in the main processing loop

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 71 Meta questions