Design a Video Conferencing Service (Zoom)

System Design
Hard
Meta
119.9K views

Design a real-time video chat service. Focus on WebRTC, media server architecture (SFU/MCU), and managing bandwidth/latency for large groups.

Why Interviewers Ask This

Meta evaluates candidates on their ability to architect scalable, low-latency real-time systems. This question specifically tests your understanding of WebRTC constraints, the trade-offs between SFU and MCU architectures for group calls, and your capacity to optimize bandwidth and jitter management under high-concurrency scenarios typical of Meta's massive user base.

How to Answer This Question

1. Clarify requirements: Define scale (users per room), latency targets (<200ms), and features like screen sharing or adaptive bitrate. 2. High-level architecture: Propose a client-server model using WebRTC for peer connections and an SFU (Selective Forwarding Unit) to manage media streams efficiently without full decoding. 3. Deep dive into media routing: Explain how the SFU forwards specific tracks based on viewer needs rather than mixing them all, reducing server CPU load. 4. Address network challenges: Discuss Adaptive Bitrate (ABR) algorithms, jitter buffers, and packet loss concealment to handle unstable connections. 5. Scalability and reliability: Outline horizontal scaling strategies using sharding and geo-distributed edge nodes to minimize latency globally, referencing Meta's focus on efficiency at scale.

Key Points to Cover

  • Explicitly choosing SFU over MCU to balance CPU load and bandwidth efficiency
  • Demonstrating knowledge of WebRTC signaling and transport protocols
  • Detailing specific strategies for handling packet loss and jitter in real-time
  • Proposing geo-distributed edge deployment to meet low-latency requirements
  • Incorporating adaptive bitrate logic to maintain quality under variable network conditions

Sample Answer

To design a Zoom-like service for Meta, I would start by defining strict latency goals, aiming for sub-200ms round-trip time. For the core architecture, I would reject a pure P2P mesh due to N-squared connection overhead and instead implement an SFU-based media server. Clients connect via WebRTC, sending encoded video to the SFU, which then selectively forwards only the necessary streams to each participant based on their active speaker status and screen resolution. This approach significantly reduces upstream bandwidth for participants compared to MCUs, as we avoid complex mixing on the server side. To handle large groups, I'd deploy edge servers globally to route traffic locally, minimizing physical distance. Crucially, I would integrate an Adaptive Bitrate algorithm that dynamically adjusts encoding quality based on real-time network conditions, ensuring smooth playback even during packet loss. For scalability, the SFU layer would be stateless and horizontally sharded, allowing us to spin up instances instantly during peak demand. Finally, robust fallback mechanisms, such as switching to audio-only or lower frame rates, ensure reliability when bandwidth drops below critical thresholds.

Common Mistakes to Avoid

  • Suggesting a pure P2P mesh architecture which fails to scale beyond small groups
  • Overlooking the computational cost of transcoding by recommending MCU for large rooms
  • Failing to discuss how to handle unstable network conditions or packet loss
  • Ignoring the need for global distribution and focusing only on a single data center

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 71 Meta questions