I am trying to design a generic job scheduler to expand my architectural knowledge and ability to think about system design questions in interviews. So far, what I have come up with is below. Can you point out where I should work on to be comprehensive in my approach to tackling this type of problem?
I have read a lot of resources online but need some specific guidance in moving forward.
Design a generic job scheduler for company X (which is one of the big technology firms of today).
Use cases
Create/Read/Update/Delete Jobs
Investigate jobs that have been run in the past (type of job, time spent, details)
Constraints
How many jobs will be run on the system per sec?
= # jobs/hour due to users + # jobs/hour due to machines
= 1m * 0.5 /day/24/3600 + 1m/50*20/24/3600
~= 12 jobs/sec
How much data will the system need to store?
Reasoning: I am only storing the job execution details, the actual work (script execution) is done > on other machines and some of the data collected is end time, success/failure status,etc. These are > all likely just text, maybe with graphing for illustration purpose. I will be storing the data of > > all jobs executed on in the system via the job scheduler (i.e. over the past 10 years)
= (Size of page where job details are set up + size of data collected about job ) * # of jobs * 365 > days * 10 years = 1 MB * 900 000 * 365 * 10
~= 3600 000 000 MB
= 3600 000 GB
=3600 TB =3.6 PB
Abstract Design
Based on the information above, we do not need to have too many machines to hold the data. I would break up the design into the following:
Application layer: serves the requests, shows UI details.
Data storage layer: Acts like a big hash table: Stores the mappings of key-value (key would be the jobs organized by dateTime they were run, while the values would show details of these jobs). This is to enable easy search of the historical and/or scheduled jobs.
The bottlenecks:
Traffic : 12 jobs/sec is not too challenging. If this spikes, we can use a load balancer to distribute the jobs to different servers for execution.
Data: At 3.6 TB, we need a hash table that can be easily queried for fast access to the jobs which have been executed in the application.
Scaling the abstract design
The nature of this job scheduler is that it each job possesses one of a few states: Pending, Failed,Success, Terminated. No business logic Returns little data.
For handling the traffic we can have an application server that handles 12 requests/sec and a backup in case this one fails. In future, we can use load balancer to reduce the number of requests going to each server (assuming >1 server are in production) Advantage of this would be to reduce number of requests/server, increase availability (in case one server fails, and handle spike-y traffic well).
For data storage, to store 3.6 TB of data we will need a few machines to hold it in database. We can use a noSQL db or SQL db. Given how the latter has more widespreaduse and community support which would help in troubleshooting matters and is used by large firms at the moment, I would choose mySQL db.
As the data grows, I would adopt the following strategies to handle it:
1) Create unique index on the hash
2) Scale mySQL db vertically by adding more memory
3) Partition the data by sharding
4) Use a master-slave replication strategy with master-master replication to ensure redundancy of data
Conclusion
Hence, this would be my design of the components of a job scheduler.
Most large-scale job schedulers consider aspects not covered in your document.
Some of the key issues are: (in no particular order)
I'm sure there's a heap more - try looking through the docs on slurm or grid-engine for more ideas.
Further things to consider: