I want to design a job scheduler cluster, which contains several hosts to do cron job scheduling. For example, a job which needs run every 5 minutes
is submitted to the cluster, the cluster should point out which host to fire next run, making sure:
- Disaster tolerance: if not all of the hosts are down, the job should be fired successfully.
- Validity: only one host to fire next job run.
Due to disaster tolerance, job cannot bind to a specific host. One way is all the hosts polling a DB table(certainly with lock), this guaranteed only one host gets the next job run. Since it often locks table, is there any better design?
Use the Quartz framework for that. It has a cron like syntax, can be clustered and only one of the hosts in the cluster will do one job at a time. If a host or job fails, another host will retry the pending job.