Design a System for Monitoring System Metrics (Prometheus)

Question

Accepted Answer

To design a robust metrics monitoring system similar to what Netflix uses, I would start by defining the scope: we need to ingest billions of data points daily from thousands of microservices with sub-second latency. The core component would be Prometheus, leveraging its pull-based model which is ideal for dynamic environments like Kubernetes where pods constantly spin up and down. In a pull model, the scraper discovers targets via service discovery, ensuring no agent crashes stop data collection, unlike push models where lost agents drop metrics silently.

However, we must address the push vs. pull nuance. For long-running services, pull is superior for reliability. But for short-lived batch jobs or cron tasks, we would use the Pushgateway pattern to buffer metrics until Prometheus scrapes them. This hybrid approach mirrors Netflix's strategy of handling both continuous streams and transient events.

For storage, I'd implement remote write to a secondary object store like S3 for long-term retention and cost efficiency, while keeping hot data on local disks for fast querying. We would also integrate Alertmanager to deduplicate and route alerts based on severity, ensuring engineers only wake up for critical issues. Finally, visualizing this data through Grafana dashboards allows teams to correlate latency spikes with deployment events, directly supporting their Site Reliability Engineering goals.

Design a System for Monitoring System Metrics (Prometheus)

Why Interviewers Ask This

How to Answer This Question

Key Points to Cover

Sample Answer

Common Mistakes to Avoid

Sound confident on this question in 5 minutes

Related Interview Questions

Design a CDN Edge Caching Strategy

Design a System for Monitoring Service Health

Design a Payment Processing System