DAG(directed acyclic graph) dynamic job scheduler

Alexandr Mazanov picture Alexandr Mazanov · Jan 12, 2013 · Viewed 18.1k times · Source

I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should have the ability to restart a failed workflow branch without waiting for whole workflow to finish execution.

Are there any frameworks in python that can handle this?

I see several core functions:

  • DAG bulding
  • Execution of nodes (run shell cmd with wait,logging etc.)
  • Ability to rebuild sub-graph in parent DAG during execution
  • Ability to manual execute nodes or sub-graph while parent graph is running
  • Suspend graph execution while waiting for external event
  • List job queue and job details

Something like Oozie, but more general purpose and in python.

Answer

zfz picture zfz · Apr 30, 2014

1) You can give dagobah a try, as described on its github page: Dagobah is a simple dependency-based job scheduler written in Python. Dagobah allows you to schedule periodic jobs using Cron syntax. Each job then kicks off a series of tasks (subprocesses) in an order defined by a dependency graph you can easily draw with click-and-drag in the web interface. This is the most lightweight scheduler project comparing with the three followings.

dagobah's web interface

2) In terms of ETL tasks, luigi which is open sourced by Spotify focus more on hadoop jobs, as described: Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

luigi's web interface

Both of the two modules are mainly written in Python and web interfaces are included for convenient management.

As far as I know, 'luigi' doesn't provide a scheduler module for job tasks, which I think is necessary for ETL tasks. But using 'luigi' is more easy to write map-reduce code in Python and thousands of tasks every day at Spotify run depend on it.

3) Like luigi, Pinterest open sourced their a workflow manager named Pinball. Pinball’s architecture follows a master-worker (or master-client to avoid naming confusion with a special type of client that we introduce below) paradigm where the stateful central master acts as a source of truth about the current system state to stateless clients. And it integrate hadoop/hive/spark jobs smoothly.

pinball's web interface

4) Airflow, yet another dag job schedule project open sourced by Airbnb, is quite like Luigi and Pinball. The backend is build on Flask, Celery and so on. According to the example job code, Airflow is both powerful and easy to use by my side.

airflow's web interface

Last but not least, Luigi, Airflow and Pinball may be more widely used. And there is a great comparison among these three: http://bytepawn.com/luigi-airflow-pinball.html