Design a Distributed Cron/Scheduler Service

Question

Accepted Answer

To design a highly available distributed cron service, I would start by defining the core requirement: executing scheduled tasks exactly once even if nodes fail. For the architecture, I'd avoid a single central server to prevent bottlenecks. Instead, I propose a cluster of scheduler nodes running a consensus protocol like Raft to maintain a shared state of pending jobs. When a job's time arrives, the leader node acquires a distributed lock using an external store like etcd or Redis. This ensures only one worker executes the task. If a worker crashes mid-execution, the heartbeat mechanism detects the failure within a configured timeout. The system then releases the lock and re-assigns the job to another available node. To handle clock skew across machines, we rely on logical timestamps rather than wall-clock time for ordering. Finally, we implement idempotency checks in the task logic itself as a safety net, ensuring that if a task runs twice due to a rare race condition, it doesn't corrupt data. This approach balances strong consistency with high availability.

Design a Distributed Cron/Scheduler Service

Why Interviewers Ask This

How to Answer This Question

Key Points to Cover

Sample Answer

Common Mistakes to Avoid

Sound confident on this question in 5 minutes

Related Interview Questions

Design a CDN Edge Caching Strategy

Design a System for Monitoring Service Health

Design a Payment Processing System