Design a System to Handle Schema Migration

System Design
Medium
LinkedIn
61.7K views

Design a robust process for executing non-blocking database schema migrations on a live production system. Discuss blue/green migration strategies and rollback plans.

Why Interviewers Ask This

Interviewers ask this to evaluate your ability to balance data integrity with service availability in high-scale environments like LinkedIn's feed. They assess whether you understand the risks of locking tables during traffic spikes and if you can architect a strategy that allows zero-downtime updates while maintaining backward compatibility between old and new application versions.

How to Answer This Question

1. Clarify constraints: Ask about current traffic volume, database type (e.g., PostgreSQL vs. Cassandra), and acceptable downtime thresholds typical for LinkedIn's scale. 2. Define the core challenge: Explain that direct ALTER TABLE commands block writes, which is unacceptable for a live social network. 3. Propose the two-phase approach: First, add the column as nullable without migrating existing data. Second, run a background job to backfill data asynchronously. 4. Discuss versioning: Describe how to deploy code that reads both old and new columns simultaneously to ensure smooth transitions. 5. Detail rollback and monitoring: Outline how to revert code changes quickly if errors occur and emphasize using feature flags to toggle migrations safely.

Key Points to Cover

  • Explicitly rejecting single-command ALTER TABLE due to lock risks
  • Implementing a dual-write pattern for backward compatibility
  • Using asynchronous backfill jobs to avoid performance degradation
  • Designing a reversible rollback mechanism via feature flags
  • Leveraging blue/green deployment for safe traffic shifting

Sample Answer

To handle schema migration on a live system like LinkedIn's, I would avoid blocking DDL operations entirely. My strategy follows a four-step phased rollout. First, I would modify the schema to add the new column as nullable, ensuring the change is non-blocking and immediate. The application code must be updated to write to both the old and new columns during this phase to maintain dual-write compatibility. Second, I would deploy a background worker to migrate existing data asynchronously. This decouples the heavy lifting from user requests, preventing latency spikes. Third, once the backfill is complete, I would update the read logic to prefer the new column while still supporting the old one for safety. Finally, after confirming stability, I would remove the legacy column. For blue/green deployment, I would spin up a new environment with the migrated schema alongside the production cluster. We would gradually shift traffic using a load balancer while running validation checks. If anomalies appear, we instantly revert traffic to the blue environment. Rollback plans involve keeping the old schema active until the final cleanup, allowing us to simply switch off the new code path if critical issues arise, ensuring zero data loss or service interruption.

Common Mistakes to Avoid

  • Suggesting a maintenance window to stop all traffic, ignoring real-world availability needs
  • Failing to mention backward compatibility between old and new application versions
  • Overlooking the risk of partial data migration causing consistency errors
  • Not defining a concrete plan for reverting changes if the migration fails mid-process

Practice This Question with AI

Answer this question orally or via text and get instant AI-powered feedback on your response quality, structure, and delivery.

Start Practicing

Related Interview Questions

Browse all 150 System Design questionsBrowse all 26 LinkedIn questions