Design a Cloud Cost Monitoring and Alerting System

Question

Accepted Answer

To design a Cloud Cost Monitoring and Alerting System suitable for Spotify's scale, I would start by clarifying our non-functional requirements: sub-hourly data freshness for critical alerts and support for both AWS and Azure billing APIs. My architecture would begin with an ingestion layer using event-driven patterns, where we pull detailed usage records via scheduled Lambda functions or stream them directly from CloudWatch/EventBridge into a message bus like Kafka. This decouples ingestion from processing.

Next, the processing layer normalizes data from different providers into a unified schema, resolving tag inconsistencies which are common in large organizations. We would store this normalized data in a high-performance time-series database like TimescaleDB or DynamoDB, optimized for fast range queries on specific team IDs or project tags. For prediction and anomaly detection, I'd implement a lightweight ML model that analyzes historical spend patterns to forecast future costs and flag sudden deviations, such as a 50% spike within two hours.

The alerting mechanism would be tiered: immediate PagerDuty notifications for critical budget breaches and daily Slack digests for minor variances. Crucially, we must include a feedback loop where engineers can acknowledge alerts to tune the sensitivity, preventing alert fatigue. Finally, we'd expose a dashboard allowing teams to visualize their own spend against budgets, fostering a culture of financial ownership while giving leadership a global view of cloud expenditure.

Design a Cloud Cost Monitoring and Alerting System

Why Interviewers Ask This

How to Answer This Question

Key Points to Cover

Sample Answer

Common Mistakes to Avoid

Sound confident on this question in 5 minutes

Related Interview Questions

Design a CDN Edge Caching Strategy

Design a System for Monitoring Service Health

Design a Payment Processing System