Design a Cloud Cost Monitoring and Alerting System

System Design
Medium
Spotify
40.9K views

Design a service to track cloud spending (AWS/Azure) across different teams/projects, predict future spend, and alert on sudden spikes.

Why Interviewers Ask This

Interviewers at Spotify ask this to evaluate your ability to balance cost efficiency with engineering velocity. They want to see if you can design a system that provides real-time visibility across distributed teams without becoming a bottleneck. This tests your understanding of cloud billing APIs, data aggregation strategies, and your capacity to build scalable alerting mechanisms that prevent budget overruns while maintaining developer autonomy.

How to Answer This Question

1. Clarify Requirements: Immediately define scope, such as supporting AWS and Azure simultaneously, handling multi-team tags, and determining acceptable latency for alerts (e.g., near-real-time vs. daily). Ask about specific budget thresholds or SLOs for the monitoring service itself. 2. Define High-Level Architecture: Propose a pipeline starting with log ingestion from cloud providers, moving through a processing layer for normalization, and ending in a storage and visualization layer. Mention how you would handle data volume spikes during month-end billing cycles. 3. Detail Core Components: Discuss specific technologies like Kinesis or Kafka for streaming, a time-series database like Prometheus or InfluxDB for metrics, and a machine learning model for anomaly detection regarding spend spikes. 4. Address Edge Cases: Explain how you will handle missing tags, currency conversion, and false positives in alerting to avoid 'alert fatigue' for engineering managers. 5. Summarize Trade-offs: Conclude by discussing the balance between cost of the monitoring tool versus potential savings, ensuring the solution aligns with Spotify's culture of experimentation and ownership.

Key Points to Cover

  • Demonstrating knowledge of specific cloud provider APIs (AWS Cost Explorer, Azure Billing API) and their limitations
  • Proposing a decoupled, event-driven architecture to handle variable data ingestion loads effectively
  • Addressing the challenge of data normalization across multiple clouds and inconsistent tagging strategies
  • Incorporating predictive analytics or anomaly detection rather than just static threshold-based alerting
  • Designing for user experience by including self-service dashboards to empower individual engineering teams

Sample Answer

To design a Cloud Cost Monitoring and Alerting System suitable for Spotify's scale, I would start by clarifying our non-functional requirements: sub-hourly data freshness for critical alerts and support for both AWS and Azure billing APIs. My architecture would begin with an ingestion layer using event-driven patterns, where we pull detailed usage records via scheduled Lambda functions or stream them directly from CloudWatch/EventBridge into a message bus like Kafka. This decouples ingestion from processing. Next, the processing layer normalizes data from different providers into a unified schema, resolving tag inconsistencies which are common in large organizations. We would store this normalized data in a high-performance time-series database like TimescaleDB or DynamoDB, optimized for fast range queries on specific team IDs or project tags. For prediction and anomaly detection, I'd implement a lightweight ML model that analyzes historical spend patterns to forecast future costs and flag sudden deviations, such as a 50% spike within two hours. The alerting mechanism would be tiered: immediate PagerDuty notifications for critical budget breaches and daily Slack digests for minor variances. Crucially, we must include a feedback loop where engineers can acknowledge alerts to tune the sensitivity, preventing alert fatigue. Finally, we'd expose a dashboard allowing teams to visualize their own spend against budgets, fostering a culture of financial ownership while giving leadership a global view of cloud expenditure.

Common Mistakes to Avoid

  • Focusing solely on the UI or dashboard without explaining the underlying data pipeline and storage strategy
  • Ignoring the complexity of normalizing data from different cloud providers with different billing granularities
  • Over-engineering the solution with complex microservices when a simpler serverless approach might suffice initially
  • Failing to discuss how to handle false positives in alerting, which leads to engineers ignoring critical warnings

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 30 Spotify questions