Design a generic job scheduler

stretchr picture stretchr · Sep 29, 2014 · Viewed 20.8k times · Source

I am trying to design a generic job scheduler to expand my architectural knowledge and ability to think about system design questions in interviews. So far, what I have come up with is below. Can you point out where I should work on to be comprehensive in my approach to tackling this type of problem?

I have read a lot of resources online but need some specific guidance in moving forward.

Design a generic job scheduler for company X (which is one of the big technology firms of today).

Use cases

Create/Read/Update/Delete Jobs

Investigate jobs that have been run in the past (type of job, time spent, details)

Constraints

How many jobs will be run on the system per sec?

= # jobs/hour due to users + # jobs/hour due to machines

= 1m * 0.5 /day/24/3600 + 1m/50*20/24/3600

~= 12 jobs/sec

How much data will the system need to store?

Reasoning: I am only storing the job execution details, the actual work (script execution) is done > on other machines and some of the data collected is end time, success/failure status,etc. These are > all likely just text, maybe with graphing for illustration purpose. I will be storing the data of > > all jobs executed on in the system via the job scheduler (i.e. over the past 10 years)

= (Size of page where job details are set up + size of data collected about job ) * # of jobs * 365 > days * 10 years = 1 MB * 900 000 * 365 * 10

~= 3600 000 000 MB

= 3600 000 GB

=3600 TB =3.6 PB

Abstract Design

Based on the information above, we do not need to have too many machines to hold the data. I would break up the design into the following:

Application layer: serves the requests, shows UI details.

Data storage layer: Acts like a big hash table: Stores the mappings of key-value (key would be the jobs organized by dateTime they were run, while the values would show details of these jobs). This is to enable easy search of the historical and/or scheduled jobs.

The bottlenecks:

Traffic : 12 jobs/sec is not too challenging. If this spikes, we can use a load balancer to distribute the jobs to different servers for execution.

Data: At 3.6 TB, we need a hash table that can be easily queried for fast access to the jobs which have been executed in the application.

Scaling the abstract design

The nature of this job scheduler is that it each job possesses one of a few states: Pending, Failed,Success, Terminated. No business logic Returns little data.

For handling the traffic we can have an application server that handles 12 requests/sec and a backup in case this one fails. In future, we can use load balancer to reduce the number of requests going to each server (assuming >1 server are in production) Advantage of this would be to reduce number of requests/server, increase availability (in case one server fails, and handle spike-y traffic well).

For data storage, to store 3.6 TB of data we will need a few machines to hold it in database. We can use a noSQL db or SQL db. Given how the latter has more widespreaduse and community support which would help in troubleshooting matters and is used by large firms at the moment, I would choose mySQL db.

As the data grows, I would adopt the following strategies to handle it:

1) Create unique index on the hash

2) Scale mySQL db vertically by adding more memory

3) Partition the data by sharding

4) Use a master-slave replication strategy with master-master replication to ensure redundancy of data

Conclusion

Hence, this would be my design of the components of a job scheduler.

Answer

Michael Anderson picture Michael Anderson · Oct 15, 2014

Most large-scale job schedulers consider aspects not covered in your document.

Some of the key issues are: (in no particular order)

  • Cancellation - you often want to kill a long running job, or prevent one from running.
  • Priority - you often want high priority jobs to run in preference to low priority jobs. But implementing this in a way that low priority jobs don't wait forever in system where lots of jobs are generated is "non-trivial"
  • Resources - some jobs may only be schedulable on systems which have certain resources. E.g. some will require large amounts of memory, or fast local disk, or fast network access. Allocating these efficiently is tricky.
  • Dependencies - some jobs may only be runable once other jobs have completed, and thus can not be scheduled before a given time.
  • Deadlines - some jobs need to be completed by a given time. (or at least be started by a given time.)
  • Permissions - some users may only be able to submit jobs to certain resource groups, or with certain properties, or a certain number of jobs, etc.
  • Quotas - some systems give users a specified amount of system time, and running a job subtracts from that. This could make a significant difference to the numbers in your example.
  • Suspension - some systems allow jobs to be check-pointed and suspended and the resumed later.

I'm sure there's a heap more - try looking through the docs on slurm or grid-engine for more ideas.

Further things to consider:

  1. Your abstract design probably needs a bunch more detail to support these advanced notions.
  2. You don't need to access most of that 3.6TB of data frequently - split it into recent and old data and you'll have a much more managable database size if you allow access to old data to be slower (and hit the disk).
  3. You probably have different categories of users, at least "admins" and "users". What does this mean for the structure of the application.
  4. A real job scheduling application is capable of handling more requests per second - slurm suggests a sustained 33/second with higher bursts, but my understanding is that it can get significantly higher than this.
  5. It's common to need to submit jobs, or query job state through interfaces other than a web-page - what does this mean for the structure of your application. (I'd either use a simpler submition API for the core engine and have the web UI as a dumb translator to that, and all additional methods use the same API, or use a REST API with a simple web front end to that))
  6. How do you detect server failure? Is two servers enough to reliably determine this? It's common to use quorum based measures for this, or connectivity tests to a third server. What do you do if the failed server comes back online?