Design a Simple ETL Pipeline

System Design
Medium
Netflix
135.7K views

Design a basic Extract, Transform, Load (ETL) pipeline to move data from a production database to an analytics warehouse. Discuss scheduling, data cleaning, and idempotency.

Why Interviewers Ask This

Interviewers at Netflix ask this to evaluate your ability to design robust, scalable data systems that prioritize reliability and data integrity. They specifically test your understanding of idempotency to prevent duplicate records in high-volume environments, your approach to handling dirty production data, and your strategy for scheduling incremental loads without impacting source system performance.

How to Answer This Question

1. Clarify Requirements: Start by defining the source (e.g., MySQL), destination (e.g., Redshift or Snowflake), volume, and latency needs, noting Netflix's preference for near real-time streaming over batch if applicable. 2. Outline Architecture: Propose a clear flow: Extract from source using CDC (Change Data Capture) or snapshots, stage raw data, apply transformations, and load into the warehouse. 3. Address Transformation Logic: Detail specific cleaning steps like handling nulls, standardizing timestamps, and deduplicating records before loading. 4. Discuss Scheduling & Reliability: Explain how you would schedule jobs (e.g., Airflow) and ensure idempotency by using upsert operations or partition-based overwrites rather than simple appends. 5. Error Handling: Briefly mention dead-letter queues for failed records and alerting mechanisms to maintain system health.

Key Points to Cover

  • Emphasize idempotency through upserts or partition overwrites to prevent data duplication
  • Propose Change Data Capture (CDC) for efficient extraction without locking the source
  • Define specific data cleaning rules like schema validation and PII masking
  • Explain how scheduling handles failures and retries gracefully
  • Reference scalability and monitoring to match Netflix's high-reliability standards

Sample Answer

I would design this pipeline using a micro-batch architecture with Apache Airflow for orchestration, given Netflix's scale. First, for extraction, I'd use Debezium to capture Change Data Capture events from the production database to minimize lock contention on the source. These events would land in a staging buffer like Kafka. Next, during transformation, I would implement a Spark job to handle data cleaning. This includes validating schema types, normalizing date formats to UTC, and removing PII where necessary. Crucially, to ensure idempotency, I would not simply append data. Instead, I would partition the target table by date and use an upsert logic based on a unique composite key. If a run fails and is retried, the final state remains consistent because the latest record overwrites any previous ones for that key. For scheduling, I'd configure Airflow DAGs to trigger every 15 minutes, ensuring low-latency analytics. I would also implement a dead-letter queue for malformed records so they don't block the entire pipeline. Finally, I'd add monitoring dashboards to track lag and error rates, aligning with Netflix's culture of operational excellence and automated remediation.

Common Mistakes to Avoid

  • Suggesting simple 'append' operations which fail idempotency checks during retries
  • Ignoring the impact of heavy ETL loads on the production database performance
  • Failing to define how bad or malformed data is handled instead of crashing the pipeline
  • Overlooking the need for a staging layer to decouple extraction from transformation

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 45 Netflix questions