Design a System for Monitoring Service Health

System Design
Medium
Salesforce
148.3K views

Design a system to collect metrics and check the health status of thousands of microservices. Discuss pull vs. push models (Prometheus vs. StatsD).

Why Interviewers Ask This

Interviewers at Salesforce ask this to evaluate your ability to design scalable, reliable monitoring systems for complex microservice architectures. They specifically assess your understanding of data collection strategies, trade-offs between pull and push models, and how to handle high-volume metrics without overwhelming the system or losing critical health signals during outages.

How to Answer This Question

1. Clarify requirements: Ask about scale (thousands of services), latency tolerance, and whether real-time alerting is needed versus batch analysis. 2. Define the architecture: Propose a layered approach involving agents on services, a central ingestion layer, time-series storage, and an alerting engine. 3. Compare collection models: Explicitly contrast Prometheus's pull model (good for reliability and scraping) against StatsD's push model (low overhead but riskier under load). 4. Address scalability: Discuss sharding strategies, cardinality limits, and handling network partitions in a distributed environment like Salesforce's ecosystem. 5. Conclude with resilience: Explain how the system ensures it remains operational even when monitored services are failing, perhaps by using local buffering before sending metrics.

Key Points to Cover

  • Explicitly comparing the reliability of pull models against the efficiency of push models
  • Demonstrating awareness of cardinality explosion risks in high-scale systems
  • Proposing a resilient architecture that buffers data during service failures
  • Addressing horizontal scaling strategies for the ingestion and storage layers
  • Connecting technical choices to business needs like reduced alert fatigue

Sample Answer

To design a system for monitoring thousands of microservices, I would first clarify that we need sub-second visibility into health status with minimal overhead. The architecture should start with lightweight sidecar agents on each service instance. For the collection strategy, I'd recommend a hybrid approach leveraging the strengths of both models. We can use a Prometheus-style pull model for core business metrics where reliability is paramount; the central server scrapes endpoints periodically, which prevents the agents from being overwhelmed during traffic spikes. However, for high-frequency, ephemeral events or bursty traffic patterns, a StatsD-like push model via UDP is more efficient to avoid backpressure. To handle the scale, we must implement aggressive aggregation at the edge and ensure unique label combinations don't explode cardinality. The backend should utilize a horizontally scalable time-series database like VictoriaMetrics or Thanos. Crucially, the system must be resilient; if a service crashes, its agent should buffer metrics locally and retry once the service recovers. Finally, we integrate this with a notification layer that triggers alerts only after sustained anomalies, reducing noise for operations teams managing large-scale cloud environments.

Common Mistakes to Avoid

  • Focusing solely on one tool without explaining the underlying architectural trade-offs
  • Ignoring the impact of high cardinality labels on storage costs and query performance
  • Assuming the monitoring system itself cannot fail and not designing for self-healing
  • Overlooking the difference between synchronous and asynchronous metric collection in high-load scenarios

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 49 Salesforce questions