Design a System for Monitoring API Latency
Design a system to measure, aggregate, and alert on the latency and error rates of thousands of API endpoints. Focus on sampling vs. full-data collection.
Why Interviewers Ask This
Interviewers at Salesforce ask this to evaluate your ability to balance system reliability with cost efficiency. They specifically want to see if you understand that collecting every data point for thousands of endpoints is unsustainable. This question tests your judgment in choosing between sampling strategies and full-data collection while ensuring critical alerts trigger without overwhelming infrastructure.
How to Answer This Question
1. Clarify Requirements: Immediately define scale, such as handling millions of requests per second across thousands of endpoints, and identify key metrics like p99 latency and error rates. 2. Propose Architecture: Sketch a high-level flow involving API gateways, collectors, and storage. 3. Address Sampling vs. Full Data: This is the core challenge. Explain why full collection is too expensive and propose adaptive sampling based on traffic volume or error thresholds. 4. Detail Aggregation: Describe how to compute percentiles (p50, p95, p99) using sliding windows or sketches like HyperLogLog to save memory. 5. Define Alerting Logic: Outline rules for triggering notifications only when anomalies exceed specific baselines to prevent alert fatigue. 6. Discuss Trade-offs: Conclude by analyzing the accuracy loss from sampling versus the cost savings, showing you understand business constraints.
Key Points to Cover
- Explicitly rejecting full-data collection in favor of adaptive sampling to manage costs
- Using sketch algorithms or sliding windows for efficient percentile calculation
- Differentiating between normal traffic sampling and 100% capture for error scenarios
- Implementing anomaly detection rather than static thresholds to reduce false positives
- Aligning the solution with enterprise-scale needs typical of large platforms like Salesforce
Sample Answer
To design a monitoring system for thousands of API endpoints, I would start by defining our SLOs, specifically targeting p99 latency under 200ms. Collecting 100% of request data is prohibitively expensive and creates unnecessary load, so I recommend an adaptive sampling strategy. We would implement a probabilistic sampler at the edge, perhaps dropping 90% of normal traffic but keeping 100% of requests that show errors or latencies above a threshold. For aggregation, instead of storing raw timestamps, we'd use a sliding time window approach where collectors aggregate metrics into buckets before sending them to a central store like Prometheus or Datadog. To handle the volume, we can employ sketch algorithms like Count-Min Sketch to estimate percentiles efficiently without storing individual latencies. The system must also include a dynamic alerting layer. If the error rate spikes above 1% or p99 latency breaches our SLO for more than two consecutive minutes, the system should trigger an immediate PagerDuty alert. Finally, we need a dashboard for real-time visualization. At Salesforce, where multi-tenant isolation is critical, we must ensure that tenant-specific latency spikes are visible without being masked by aggregate data. This approach balances granular visibility with the scalability required for enterprise workloads.
Common Mistakes to Avoid
- Suggesting full data collection for all requests, which ignores scalability and cost constraints
- Focusing solely on storage without explaining how to calculate percentiles efficiently at scale
- Ignoring the difference between average latency and tail latency (p99), which matters most for user experience
- Proposing a single centralized database for ingestion, creating a bottleneck instead of a distributed pipeline
Practice This Question with AI
Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.
Related Interview Questions
Design a CDN Edge Caching Strategy
Medium
AmazonDesign a System for Monitoring Service Health
Medium
SalesforceDesign a Payment Processing System
Hard
UberDesign a System for Real-Time Fleet Management
Hard
UberTrade-offs: Customization vs. Standardization
Medium
SalesforceSearch in Rotated Sorted Array
Medium
Salesforce