Design a System for Distributed Tracing (Jaeger/Zipkin)

System Design
Medium
Google
25K views

Design a system to trace a single request across hundreds of microservices. Focus on span collection, sampling, and visualization for debugging performance bottlenecks.

Why Interviewers Ask This

Interviewers ask this to evaluate your ability to design scalable, high-performance systems that handle massive data ingestion without blocking user requests. At Google, they specifically test if you understand the trade-offs between consistency and availability in distributed environments, and whether you can implement efficient sampling strategies to manage storage costs while retaining critical debugging data.

How to Answer This Question

1. Clarify requirements by defining scale (requests per second), retention policies, and latency constraints typical of Google's infrastructure. 2. Propose a high-level architecture involving client-side instrumentation, an agent for aggregation, and a centralized collector service. 3. Detail the span lifecycle: generation at microservices, propagation via context headers (like W3C Trace Context), and transmission to collectors. 4. Discuss sampling strategies, contrasting head-based sampling for critical paths with tail-based sampling for error analysis, explaining how to reduce load on downstream storage. 5. Address visualization and querying, suggesting a columnar database like Bigtable or Spanner for fast lookups, and explain how to index traces for efficient filtering by service or latency thresholds.

Key Points to Cover

  • Explicitly define the difference between head-based and tail-based sampling strategies
  • Explain how asynchronous batching prevents the tracing system from blocking production code
  • Propose a storage solution optimized for write-heavy workloads and fast read queries
  • Demonstrate understanding of context propagation mechanisms like W3C Trace Context
  • Address cost implications of storing full trace data versus sampled data

Sample Answer

To design a distributed tracing system like Jaeger for Google-scale services, I would start by establishing the core components: instrumentation agents, collectors, and a storage backend. First, each microservice generates spans representing operations, propagating trace IDs using standard headers to maintain context across boundaries. These spans are sent asynchronously to local sidecar agents to prevent blocking the main application thread. Next, the agents batch and forward data to a centralized collector cluster. A critical challenge here is volume; sending every single request is unsustainable. I would implement adaptive sampling. Head-based sampling would randomly sample 1% of traffic to get a statistical overview. However, for debugging, we need tail-based sampling. If any span in a trace exceeds a latency threshold or returns an error, the entire trace is sampled and stored. This ensures we capture failures without overwhelming our storage. For storage, I'd recommend a time-series optimized database like Bigtable or Cassandra, partitioned by trace ID to ensure fast retrieval. The query layer would allow engineers to filter by service name, operation duration, or error codes. Finally, visualization tools would aggregate these spans into a Gantt-chart style timeline, highlighting bottlenecks where latency spikes occur between specific service calls, enabling rapid root cause analysis during outages.

Common Mistakes to Avoid

  • Focusing only on the UI visualization while ignoring the heavy data ingestion pipeline
  • Suggesting synchronous tracing which would introduce unacceptable latency to user requests
  • Overlooking the need for sampling, leading to a proposal that cannot scale to billions of daily requests
  • Ignoring context propagation, making it impossible to link spans across different microservices

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 87 Google questions