Design a Chat System (WhatsApp/WeChat)
Design a large-scale 1-on-1 and group chat application. Focus on real-time messaging, message delivery guarantees, state management, and the use of WebSockets vs. persistent connections.
Why Interviewers Ask This
Meta asks this to evaluate your ability to architect real-time systems handling millions of concurrent connections. They specifically test your understanding of WebSocket persistence, message delivery guarantees like at-least-once semantics, and how to manage state consistency across distributed servers for both private and group chats without data loss.
How to Answer This Question
1. Clarify requirements: Define scale (DAU), latency targets (sub-second), and features like read receipts or typing indicators. 2. Estimate capacity: Calculate messages per second and storage needs using Meta's typical traffic patterns. 3. Design the API: Outline REST endpoints for initial sync and define the WebSocket protocol for real-time streams. 4. Architecture core: Propose a sharded microservices model where users connect to specific chat nodes based on user ID hashing. 5. Ensure reliability: Discuss message queues (Kafka) for durability, acknowledgment protocols, and conflict resolution for group edits. 6. Address scaling: Explain horizontal scaling strategies for connection managers and database sharding for message history.
Key Points to Cover
- Explicitly choosing WebSockets over HTTP polling for low-latency bidirectional communication
- Implementing a write-through pattern with Kafka to guarantee message durability before delivery
- Using consistent hashing to shard users across servers for efficient load distribution
- Defining an acknowledgment protocol to ensure at-least-once delivery semantics
- Addressing the complexity of fan-out optimization for group messaging scenarios
Sample Answer
To design a scalable chat system like WhatsApp, I start by defining non-functional requirements: sub-50ms latency globally and 99.99% availability. First, we need an API gateway that handles authentication via OAuth, then routes requests to appropriate services. For real-time communication, WebSockets are essential over HTTP polling due to their persistent, full-duplex nature, which is critical for Meta's high-frequency interaction models. We would shard users across multiple Chat Services; each service manages a subset of active connections. When User A sends a message to User B, the request hits the Chat Service, pushes the payload to a Kafka topic for durability, and immediately acknowledges receipt. The receiving Chat Service consumes from Kafka and pushes the message via the user's open WebSocket. For delivery guarantees, we implement an acknowledgment mechanism where the receiver sends a 'read' event back, triggering a status update. To handle offline users, we store messages in a durable key-value store like Cassandra or DynamoDB, retrieving them upon reconnection. Finally, for group chats, we must handle fan-out efficiently by pushing to a dedicated group channel rather than individual user streams, ensuring consistent state across all participants.
Common Mistakes to Avoid
- Focusing only on database schema while ignoring the critical role of connection management and WebSocket heartbeats
- Proposing simple polling mechanisms which fail under the high concurrency demands of a platform like Meta
- Neglecting to discuss message ordering and potential race conditions in group chat updates
- Overlooking the strategy for handling offline users and syncing message history upon reconnection
Practice This Question with AI
Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.
Related Interview Questions
Design a Payment Processing System
Hard
UberDesign a System for Real-Time Fleet Management
Hard
UberDesign a CDN Edge Caching Strategy
Medium
AmazonDesign a System for Monitoring Service Health
Medium
SalesforceFind K Closest Elements (Heaps)
Medium
MetaShould Meta launch a paid, ad-free version of Instagram?
Hard
Meta