Design a Simple ETL Pipeline

Question

Accepted Answer

I would design this pipeline using a micro-batch architecture with Apache Airflow for orchestration, given Netflix's scale. First, for extraction, I'd use Debezium to capture Change Data Capture events from the production database to minimize lock contention on the source. These events would land in a staging buffer like Kafka.

Next, during transformation, I would implement a Spark job to handle data cleaning. This includes validating schema types, normalizing date formats to UTC, and removing PII where necessary. Crucially, to ensure idempotency, I would not simply append data. Instead, I would partition the target table by date and use an upsert logic based on a unique composite key. If a run fails and is retried, the final state remains consistent because the latest record overwrites any previous ones for that key.

For scheduling, I'd configure Airflow DAGs to trigger every 15 minutes, ensuring low-latency analytics. I would also implement a dead-letter queue for malformed records so they don't block the entire pipeline. Finally, I'd add monitoring dashboards to track lag and error rates, aligning with Netflix's culture of operational excellence and automated remediation.

Design a Simple ETL Pipeline

Why Interviewers Ask This

How to Answer This Question

Key Points to Cover

Sample Answer

Common Mistakes to Avoid

Sound confident on this question in 5 minutes

Related Interview Questions

Design a CDN Edge Caching Strategy

Design a System for Monitoring Service Health

Design a Payment Processing System