Which one to choose Apache Oozie or Apache Airflow? Need a comparison

Vishal786btc picture Vishal786btc · Dec 21, 2017 · Viewed 13.1k times · Source

I am new to job schedulers and was looking out for one to run jobs on big data cluster. I was quite confused with the available choices. Found Oozie to have many limitations as compared to the already existing ones such as TWS, Autosys, etc.

Need some comparison points on Oozie vs. Airflow.

Appreciate your help.

Answer

Michele 'Ubik' De Simoni picture Michele 'Ubik' De Simoni · Dec 21, 2017

In my experience Airflow is the best data pipeline right now. It's best suited for managing complex, long running workflows. UI and modularity are over the top.

Airflow

  • + Python Code for DAGs
  • + Has connectors for every major service/cloud provider
  • + More versatile
  • + Advanced metrics
  • + Better UI and API
  • + Capable of creating extremely complex workflows
  • + Jinja Templating
  • + Can be used as an Orchestrator for the Tensorflow Extended ecosystem
  • = Can be parallelized
  • = Native Connections to HDFS, HIVE, PIG etc..
  • = Graph as DAG

Oozie

  • --- Java or XML for DAGs
  • - hard to build complex pipelines
  • - smaller, less active community
  • - worse WEB GUI
  • - Java API
  • = Can be parallelized
  • = Native Connections to HDFS, HIVE, PIG etc..
  • = Graph as DAG

As you see, Airflow is an easier to use (especially in large heteregenoeus team), more versatile and powerful option than Oozie.

As I said: go with Airflow.

Article you may find interesting